1. 13 10月, 2021 1 次提交
  2. 15 6月, 2021 1 次提交
    • X
      tipc: skb_linearize the head skb when reassembling msgs · 29102688
      Xin Long 提交于
      stable inclusion
      from stable-5.10.42
      commit 6da24cfc83ba4f97ea44fc7ae9999a006101755c
      bugzilla: 55093
      CVE: NA
      
      --------------------------------
      
      commit b7df21cf upstream.
      
      It's not a good idea to append the frag skb to a skb's frag_list if
      the frag_list already has skbs from elsewhere, such as this skb was
      created by pskb_copy() where the frag_list was cloned (all the skbs
      in it were skb_get'ed) and shared by multiple skbs.
      
      However, the new appended frag skb should have been only seen by the
      current skb. Otherwise, it will cause use after free crashes as this
      appended frag skb are seen by multiple skbs but it only got skb_get
      called once.
      
      The same thing happens with a skb updated by pskb_may_pull() with a
      skb_cloned skb. Li Shuang has reported quite a few crashes caused
      by this when doing testing over macvlan devices:
      
        [] kernel BUG at net/core/skbuff.c:1970!
        [] Call Trace:
        []  skb_clone+0x4d/0xb0
        []  macvlan_broadcast+0xd8/0x160 [macvlan]
        []  macvlan_process_broadcast+0x148/0x150 [macvlan]
        []  process_one_work+0x1a7/0x360
        []  worker_thread+0x30/0x390
      
        [] kernel BUG at mm/usercopy.c:102!
        [] Call Trace:
        []  __check_heap_object+0xd3/0x100
        []  __check_object_size+0xff/0x16b
        []  simple_copy_to_iter+0x1c/0x30
        []  __skb_datagram_iter+0x7d/0x310
        []  __skb_datagram_iter+0x2a5/0x310
        []  skb_copy_datagram_iter+0x3b/0x90
        []  tipc_recvmsg+0x14a/0x3a0 [tipc]
        []  ____sys_recvmsg+0x91/0x150
        []  ___sys_recvmsg+0x7b/0xc0
      
        [] kernel BUG at mm/slub.c:305!
        [] Call Trace:
        []  <IRQ>
        []  kmem_cache_free+0x3ff/0x400
        []  __netif_receive_skb_core+0x12c/0xc40
        []  ? kmem_cache_alloc+0x12e/0x270
        []  netif_receive_skb_internal+0x3d/0xb0
        []  ? get_rx_page_info+0x8e/0xa0 [be2net]
        []  be_poll+0x6ef/0xd00 [be2net]
        []  ? irq_exit+0x4f/0x100
        []  net_rx_action+0x149/0x3b0
      
        ...
      
      This patch is to fix it by linearizing the head skb if it has frag_list
      set in tipc_buf_append(). Note that we choose to do this before calling
      skb_unshare(), as __skb_linearize() will avoid skb_copy(). Also, we can
      not just drop the frag_list either as the early time.
      
      Fixes: 45c8b7b1 ("tipc: allow non-linear first fragment buffer")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      29102688
  3. 30 10月, 2020 1 次提交
    • T
      tipc: fix memory leak caused by tipc_buf_append() · ceb1eb2f
      Tung Nguyen 提交于
      Commit ed42989e ("tipc: fix the skb_unshare() in tipc_buf_append()")
      replaced skb_unshare() with skb_copy() to not reduce the data reference
      counter of the original skb intentionally. This is not the correct
      way to handle the cloned skb because it causes memory leak in 2
      following cases:
       1/ Sending multicast messages via broadcast link
        The original skb list is cloned to the local skb list for local
        destination. After that, the data reference counter of each skb
        in the original list has the value of 2. This causes each skb not
        to be freed after receiving ACK:
        tipc_link_advance_transmq()
        {
         ...
         /* release skb */
         __skb_unlink(skb, &l->transmq);
         kfree_skb(skb); <-- memory exists after being freed
        }
      
       2/ Sending multicast messages via replicast link
        Similar to the above case, each skb cannot be freed after purging
        the skb list:
        tipc_mcast_xmit()
        {
         ...
         __skb_queue_purge(pkts); <-- memory exists after being freed
        }
      
      This commit fixes this issue by using skb_unshare() instead. Besides,
      to avoid use-after-free error reported by KASAN, the pointer to the
      fragment is set to NULL before calling skb_unshare() to make sure that
      the original skb is not freed after freeing the fragment 2 times in
      case skb_unshare() returns NULL.
      
      Fixes: ed42989e ("tipc: fix the skb_unshare() in tipc_buf_append()")
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Reported-by: NThang Hoang Ngo <thang.h.ngo@dektech.com.au>
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Link: https://lore.kernel.org/r/20201027032403.1823-1-tung.q.nguyen@dektech.com.auSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      ceb1eb2f
  4. 10 10月, 2020 1 次提交
    • C
      tipc: fix the skb_unshare() in tipc_buf_append() · ed42989e
      Cong Wang 提交于
      skb_unshare() drops a reference count on the old skb unconditionally,
      so in the failure case, we end up freeing the skb twice here.
      And because the skb is allocated in fclone and cloned by caller
      tipc_msg_reassemble(), the consequence is actually freeing the
      original skb too, thus triggered the UAF by syzbot.
      
      Fix this by replacing this skb_unshare() with skb_cloned()+skb_copy().
      
      Fixes: ff48b622 ("tipc: use skb_unshare() instead in tipc_buf_append()")
      Reported-and-tested-by: syzbot+e96a7ba46281824cc46a@syzkaller.appspotmail.com
      Cc: Jon Maloy <jmaloy@redhat.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ed42989e
  5. 19 9月, 2020 1 次提交
  6. 15 9月, 2020 1 次提交
    • X
      tipc: use skb_unshare() instead in tipc_buf_append() · ff48b622
      Xin Long 提交于
      In tipc_buf_append() it may change skb's frag_list, and it causes
      problems when this skb is cloned. skb_unclone() doesn't really
      make this skb's flag_list available to change.
      
      Shuang Li has reported an use-after-free issue because of this
      when creating quite a few macvlan dev over the same dev, where
      the broadcast packets will be cloned and go up to the stack:
      
       [ ] BUG: KASAN: use-after-free in pskb_expand_head+0x86d/0xea0
       [ ] Call Trace:
       [ ]  dump_stack+0x7c/0xb0
       [ ]  print_address_description.constprop.7+0x1a/0x220
       [ ]  kasan_report.cold.10+0x37/0x7c
       [ ]  check_memory_region+0x183/0x1e0
       [ ]  pskb_expand_head+0x86d/0xea0
       [ ]  process_backlog+0x1df/0x660
       [ ]  net_rx_action+0x3b4/0xc90
       [ ]
       [ ] Allocated by task 1786:
       [ ]  kmem_cache_alloc+0xbf/0x220
       [ ]  skb_clone+0x10a/0x300
       [ ]  macvlan_broadcast+0x2f6/0x590 [macvlan]
       [ ]  macvlan_process_broadcast+0x37c/0x516 [macvlan]
       [ ]  process_one_work+0x66a/0x1060
       [ ]  worker_thread+0x87/0xb10
       [ ]
       [ ] Freed by task 3253:
       [ ]  kmem_cache_free+0x82/0x2a0
       [ ]  skb_release_data+0x2c3/0x6e0
       [ ]  kfree_skb+0x78/0x1d0
       [ ]  tipc_recvmsg+0x3be/0xa40 [tipc]
      
      So fix it by using skb_unshare() instead, which would create a new
      skb for the cloned frag and it'll be safe to change its frag_list.
      The similar things were also done in sctp_make_reassembled_event(),
      which is using skb_copy().
      Reported-by: NShuang Li <shuali@redhat.com>
      Fixes: 37e22164 ("tipc: rename and move message reassembly function")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff48b622
  7. 14 7月, 2020 1 次提交
  8. 12 6月, 2020 1 次提交
    • T
      tipc: fix kernel WARNING in tipc_msg_append() · c9aa81fa
      Tuong Lien 提交于
      syzbot found the following issue:
      
      WARNING: CPU: 0 PID: 6808 at include/linux/thread_info.h:150 check_copy_size include/linux/thread_info.h:150 [inline]
      WARNING: CPU: 0 PID: 6808 at include/linux/thread_info.h:150 copy_from_iter include/linux/uio.h:144 [inline]
      WARNING: CPU: 0 PID: 6808 at include/linux/thread_info.h:150 tipc_msg_append+0x49a/0x5e0 net/tipc/msg.c:242
      Kernel panic - not syncing: panic_on_warn set ...
      
      This happens after commit 5e9eeccc ("tipc: fix NULL pointer
      dereference in streaming") that tried to build at least one buffer even
      when the message data length is zero... However, it now exposes another
      bug that the 'mss' can be zero and the 'cpy' will be negative, thus the
      above kernel WARNING will appear!
      The zero value of 'mss' is never expected because it means Nagle is not
      enabled for the socket (actually the socket type was 'SOCK_SEQPACKET'),
      so the function 'tipc_msg_append()' must not be called at all. But that
      was in this particular case since the message data length was zero, and
      the 'send <= maxnagle' check became true.
      
      We resolve the issue by explicitly checking if Nagle is enabled for the
      socket, i.e. 'maxnagle != 0' before calling the 'tipc_msg_append()'. We
      also reinforce the function to against such a negative values if any.
      
      Reported-by: syzbot+75139a7d2605236b0b7f@syzkaller.appspotmail.com
      Fixes: c0bceb97 ("tipc: add smart nagle feature")
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9aa81fa
  9. 05 6月, 2020 1 次提交
    • T
      tipc: fix NULL pointer dereference in streaming · 5e9eeccc
      Tuong Lien 提交于
      syzbot found the following crash:
      
      general protection fault, probably for non-canonical address 0xdffffc0000000019: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
      CPU: 1 PID: 7060 Comm: syz-executor394 Not tainted 5.7.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__tipc_sendstream+0xbde/0x11f0 net/tipc/socket.c:1591
      Code: 00 00 00 00 48 39 5c 24 28 48 0f 44 d8 e8 fa 3e db f9 48 b8 00 00 00 00 00 fc ff df 48 8d bb c8 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 e2 04 00 00 48 8b 9b c8 00 00 00 48 b8 00 00 00
      RSP: 0018:ffffc90003ef7818 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8797fd9d
      RDX: 0000000000000019 RSI: ffffffff8797fde6 RDI: 00000000000000c8
      RBP: ffff888099848040 R08: ffff88809a5f6440 R09: fffffbfff1860b4c
      R10: ffffffff8c305a5f R11: fffffbfff1860b4b R12: ffff88809984857e
      R13: 0000000000000000 R14: ffff888086aa4000 R15: 0000000000000000
      FS:  00000000009b4880(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000140 CR3: 00000000a7fdf000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tipc_sendstream+0x4c/0x70 net/tipc/socket.c:1533
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:672
       ____sys_sendmsg+0x32f/0x810 net/socket.c:2352
       ___sys_sendmsg+0x100/0x170 net/socket.c:2406
       __sys_sendmmsg+0x195/0x480 net/socket.c:2496
       __do_sys_sendmmsg net/socket.c:2525 [inline]
       __se_sys_sendmmsg net/socket.c:2522 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2522
       do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x440199
      ...
      
      This bug was bisected to commit 0a3e060f ("tipc: add test for Nagle
      algorithm effectiveness"). However, it is not the case, the trouble was
      from the base in the case of zero data length message sending, we would
      unexpectedly make an empty 'txq' queue after the 'tipc_msg_append()' in
      Nagle mode.
      
      A similar crash can be generated even without the bisected patch but at
      the link layer when it accesses the empty queue.
      
      We solve the issues by building at least one buffer to go with socket's
      header and an optional data section that may be empty like what we had
      with the 'tipc_msg_build()'.
      
      Note: the previous commit 4c21daae ("tipc: Fix NULL pointer
      dereference in __tipc_sendstream()") is obsoleted by this one since the
      'txq' will be never empty and the check of 'skb != NULL' is unnecessary
      but it is safe anyway.
      
      Reported-by: syzbot+8eac6d030e7807c21d32@syzkaller.appspotmail.com
      Fixes: c0bceb97 ("tipc: add smart nagle feature")
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e9eeccc
  10. 30 5月, 2020 1 次提交
  11. 27 5月, 2020 2 次提交
    • T
      tipc: add test for Nagle algorithm effectiveness · 0a3e060f
      Tuong Lien 提交于
      When streaming in Nagle mode, we try to bundle small messages from user
      as many as possible if there is one outstanding buffer, i.e. not ACK-ed
      by the receiving side, which helps boost up the overall throughput. So,
      the algorithm's effectiveness really depends on when Nagle ACK comes or
      what the specific network latency (RTT) is, compared to the user's
      message sending rate.
      
      In a bad case, the user's sending rate is low or the network latency is
      small, there will not be many bundles, so making a Nagle ACK or waiting
      for it is not meaningful.
      For example: a user sends its messages every 100ms and the RTT is 50ms,
      then for each messages, we require one Nagle ACK but then there is only
      one user message sent without any bundles.
      
      In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
      but now the user sends messages in medium size, then there will not be
      any difference at all, that says 3 x 1000-byte data messages if bundled
      will still result in 3 bundles with MTU = 1500.
      
      When Nagle is ineffective, the delay in user message sending is clearly
      wasted instead of sending directly.
      
      Besides, adding Nagle ACKs will consume some processor load on both the
      sending and receiving sides.
      
      This commit adds a test on the effectiveness of the Nagle algorithm for
      an individual connection in the network on which it actually runs.
      Particularly, upon receipt of a Nagle ACK we will compare the number of
      bundles in the backlog queue to the number of user messages which would
      be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
      mode will be kept for further message sending. Otherwise, we will leave
      Nagle and put a 'penalty' on the connection, so it will have to spend
      more 'one-way' messages before being able to re-enter Nagle.
      
      In addition, the 'ack-required' bit is only set when really needed that
      the number of Nagle ACKs will be reduced during Nagle mode.
      
      Testing with benchmark showed that with the patch, there was not much
      difference in throughput for small messages since the tool continuously
      sends messages without a break, so Nagle would still take in effect.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a3e060f
    • T
      tipc: add support for broadcast rcv stats dumping · 03b6fefd
      Tuong Lien 提交于
      This commit enables dumping the statistics of a broadcast-receiver link
      like the traditional 'broadcast-link' one (which is for broadcast-
      sender). The link dumping can be triggered via netlink (e.g. the
      iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
      indicator.
      
      The name of a broadcast-receiver link of a specific peer will be in the
      format: 'broadcast-link:<peer-id>'.
      
      For example:
      
      Link <broadcast-link:1001002>
        Window:50 packets
        RX packets:7841 fragments:2408/440 bundles:0/0
        TX packets:0 fragments:0/0 bundles:0/0
        RX naks:0 defs:124 dups:0
        TX naks:21 acks:0 retrans:0
        Congestion link:0  Send queue max:0 avg:0
      
      In addition, the broadcast-receiver link statistics can be reset in the
      usual way via netlink by specifying that link name in command.
      
      Note: the 'tipc_link_name_ext()' is removed because the link name can
      now be retrieved simply via the 'l->name'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03b6fefd
  12. 15 3月, 2020 1 次提交
  13. 09 11月, 2019 1 次提交
    • T
      tipc: introduce TIPC encryption & authentication · fc1b6d6d
      Tuong Lien 提交于
      This commit offers an option to encrypt and authenticate all messaging,
      including the neighbor discovery messages. The currently most advanced
      algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
      encryption/decryption is done at the bearer layer, just before leaving
      or after entering TIPC.
      
      Supported features:
      - Encryption & authentication of all TIPC messages (header + data);
      - Two symmetric-key modes: Cluster and Per-node;
      - Automatic key switching;
      - Key-expired revoking (sequence number wrapped);
      - Lock-free encryption/decryption (RCU);
      - Asynchronous crypto, Intel AES-NI supported;
      - Multiple cipher transforms;
      - Logs & statistics;
      
      Two key modes:
      - Cluster key mode: One single key is used for both TX & RX in all
      nodes in the cluster.
      - Per-node key mode: Each nodes in the cluster has one specific TX key.
      For RX, a node requires its peers' TX key to be able to decrypt the
      messages from those peers.
      
      Key setting from user-space is performed via netlink by a user program
      (e.g. the iproute2 'tipc' tool).
      
      Internal key state machine:
      
                                       Attach    Align(RX)
                                           +-+   +-+
                                           | V   | V
              +---------+      Attach     +---------+
              |  IDLE   |---------------->| PENDING |(user = 0)
              +---------+                 +---------+
                 A   A                   Switch|  A
                 |   |                         |  |
                 |   | Free(switch/revoked)    |  |
           (Free)|   +----------------------+  |  |Timeout
                 |              (TX)        |  |  |(RX)
                 |                          |  |  |
                 |                          |  v  |
              +---------+      Switch     +---------+
              | PASSIVE |<----------------| ACTIVE  |
              +---------+       (RX)      +---------+
              (user = 1)                  (user >= 1)
      
      The number of TFMs is 10 by default and can be changed via the procfs
      'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
      also used to print the crypto statistics at runtime:
      
      echo 0xfff1 > /proc/sys/net/tipc/max_tfms
      
      The patch defines a new TIPC version (v7) for the encryption message (-
      backward compatibility as well). The message is basically encapsulated
      as follows:
      
         +----------------------------------------------------------+
         | TIPCv7 encryption  | Original TIPCv2    | Authentication |
         | header             | packet (encrypted) | Tag            |
         +----------------------------------------------------------+
      
      The throughput is about ~40% for small messages (compared with non-
      encryption) and ~9% for large messages. With the support from hardware
      crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
      upto ~85% for small messages and ~55% for large messages.
      
      By default, the new feature is inactive (i.e. no encryption) until user
      sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
      in the kernel configuration to enable/disable the new code when needed.
      
      MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc1b6d6d
  14. 04 11月, 2019 1 次提交
    • T
      tipc: improve message bundling algorithm · 06e7c70c
      Tuong Lien 提交于
      As mentioned in commit e95584a8 ("tipc: fix unlimited bundling of
      small messages"), the current message bundling algorithm is inefficient
      that can generate bundles of only one payload message, that causes
      unnecessary overheads for both the sender and receiver.
      
      This commit re-designs the 'tipc_msg_make_bundle()' function (now named
      as 'tipc_msg_try_bundle()'), so that when a message comes at the first
      place, we will just check & keep a reference to it if the message is
      suitable for bundling. The message buffer will be put into the link
      backlog queue and processed as normal. Later on, when another one comes
      we will make a bundle with the first message if possible and so on...
      This way, a bundle if really needed will always consist of at least two
      payload messages. Otherwise, we let the first buffer go its way without
      any need of bundling, so reduce the overheads to zero.
      
      Moreover, since now we have both the messages in hand, we can even
      optimize the 'tipc_msg_bundle()' function, make bundle of a very large
      (size ~ MSS) and small messages which is not with the current algorithm
      e.g. [1400-byte message] + [10-byte message] (MTU = 1500).
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06e7c70c
  15. 31 10月, 2019 1 次提交
    • J
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy 提交于
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bceb97
  16. 02 10月, 2019 1 次提交
    • T
      tipc: fix unlimited bundling of small messages · e95584a8
      Tuong Lien 提交于
      We have identified a problem with the "oversubscription" policy in the
      link transmission code.
      
      When small messages are transmitted, and the sending link has reached
      the transmit window limit, those messages will be bundled and put into
      the link backlog queue. However, bundles of data messages are counted
      at the 'CRITICAL' level, so that the counter for that level, instead of
      the counter for the real, bundled message's level is the one being
      increased.
      Subsequent, to-be-bundled data messages at non-CRITICAL levels continue
      to be tested against the unchanged counter for their own level, while
      contributing to an unrestrained increase at the CRITICAL backlog level.
      
      This leaves a gap in congestion control algorithm for small messages
      that can result in starvation for other users or a "real" CRITICAL
      user. Even that eventually can lead to buffer exhaustion & link reset.
      
      We fix this by keeping a 'target_bskb' buffer pointer at each levels,
      then when bundling, we only bundle messages at the same importance
      level only. This way, we know exactly how many slots a certain level
      have occupied in the queue, so can manage level congestion accurately.
      
      By bundling messages at the same level, we even have more benefits. Let
      consider this:
      - One socket sends 64-byte messages at the 'CRITICAL' level;
      - Another sends 4096-byte messages at the 'LOW' level;
      
      When a 64-byte message comes and is bundled the first time, we put the
      overhead of message bundle to it (+ 40-byte header, data copy, etc.)
      for later use, but the next message can be a 4096-byte one that cannot
      be bundled to the previous one. This means the last bundle carries only
      one payload message which is totally inefficient, as for the receiver
      also! Later on, another 64-byte message comes, now we make a new bundle
      and the same story repeats...
      
      With the new bundling algorithm, this will not happen, the 64-byte
      messages will be bundled together even when the 4096-byte message(s)
      comes in between. However, if the 4096-byte messages are sent at the
      same level i.e. 'CRITICAL', the bundling algorithm will again cause the
      same overhead.
      
      Also, the same will happen even with only one socket sending small
      messages at a rate close to the link transmit's one, so that, when one
      message is bundled, it's transmitted shortly. Then, another message
      comes, a new bundle is created and so on...
      
      We will solve this issue radically by another patch.
      
      Fixes: 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reported-by: NHoang Le <hoang.h.le@dektech.com.au>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e95584a8
  17. 26 7月, 2019 1 次提交
    • T
      tipc: fix changeover issues due to large packet · 2320bcda
      Tuong Lien 提交于
      In conjunction with changing the interfaces' MTU (e.g. especially in
      the case of a bonding) where the TIPC links are brought up and down
      in a short time, a couple of issues were detected with the current link
      changeover mechanism:
      
      1) When one link is up but immediately forced down again, the failover
      procedure will be carried out in order to failover all the messages in
      the link's transmq queue onto the other working link. The link and node
      state is also set to FAILINGOVER as part of the process. The message
      will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
      bytes (= the message header size). There is no problem if the original
      message size is not larger than the link's MTU - 40, and indeed this is
      the max size of a normal payload messages. However, in the situation
      above, because the link has just been up, the messages in the link's
      transmq are almost SYNCH_MSGs which had been generated by the link
      synching procedure, then their size might reach the max value already!
      When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
      will exceed the link's MTU. As a result, the messages are dropped
      silently and the failover procedure will never end up, the link will
      not be able to exit the FAILINGOVER state, so cannot be re-established.
      
      2) The same scenario above can happen more easily in case the MTU of
      the links is set differently or when changing. In that case, as long as
      a large message in the failure link's transmq queue was built and
      fragmented with its link's MTU > the other link's one, the issue will
      happen (there is no need of a link synching in advance).
      
      3) The link synching procedure also faces with the same issue but since
      the link synching is only started upon receipt of a SYNCH_MSG, dropping
      the message will not result in a state deadlock, but it is not expected
      as design.
      
      The 1) & 3) issues are resolved by the last commit that only a dummy
      SYNCH_MSG (i.e. without data) is generated at the link synching, so the
      size of a FAILOVER_MSG if any then will never exceed the link's MTU.
      
      For the 2) issue, the only solution is trying to fragment the messages
      in the failure link's transmq queue according to the working link's MTU
      so they can be failovered then. A new function is made to accomplish
      this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
      original message size is too large, it will be fragmented & reassembled
      at the receiving side.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2320bcda
  18. 30 9月, 2018 2 次提交
    • T
      tipc: buffer overflow handling in listener socket · 67879274
      Tung Nguyen 提交于
      Default socket receive buffer size for a listener socket is 2Mb. For
      each arriving empty SYN, the linux kernel allocates a 768 bytes buffer.
      This means that a listener socket can serve maximum 2700 simultaneous
      empty connection setup requests before it hits a receive buffer
      overflow, and much fewer if the SYN is carrying any significant
      amount of data.
      
      When this happens the setup request is rejected, and the client
      receives an ECONNREFUSED error.
      
      This commit mitigates this problem by letting the client socket try to
      retransmit the SYN message multiple times when it sees it rejected with
      the code TIPC_ERR_OVERLOAD. Retransmission is done at random intervals
      in the range of [100 ms, setup_timeout / 4], as many times as there is
      room for within the setup timeout limit.
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67879274
    • J
      tipc: refactor function tipc_msg_reverse() · 5cbdbd1a
      Jon Maloy 提交于
      The function tipc_msg_reverse() is reversing the header of a message
      while reusing the original buffer. We have seen at several occasions
      that this may have unfortunate side effects when the buffer to be
      reversed is a clone.
      
      In one of the following commits we will again need to reverse cloned
      buffers, so this is the right time to permanently eliminate this
      problem. In this commit we let the said function always consume the
      original buffer and replace it with a new one when applicable.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cbdbd1a
  19. 30 6月, 2018 1 次提交
    • T
      tipc: eliminate buffer cloning in function tipc_msg_extract() · ef9be755
      Tung Nguyen 提交于
      The function tipc_msg_extract() is using skb_clone() to clone inner
      messages from a message bundle buffer. Although this method is safe,
      it has an undesired effect that each buffer clone inherits the
      true-size of the bundling buffer. As a result, the buffer clone
      almost always ends up with being copied anyway by the message
      validation function. This makes the cloning into a sub-optimization.
      
      In this commit we take the consequence of this realization, and copy
      each inner message to a separately allocated buffer up front in the
      extraction function.
      
      As a bonus we can now eliminate the two cases where we had to copy
      re-routed packets that may potentially go out on the wire again.
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef9be755
  20. 18 3月, 2018 1 次提交
    • J
      tipc: obsolete TIPC_ZONE_SCOPE · 928df188
      Jon Maloy 提交于
      Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
      aspects handled the same way, both on the publishing node and on the
      receiving nodes.
      
      Despite previous ambitions to the contrary, this is never going to change,
      so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
      macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
      using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
      remain compatible with users and remote nodes still using ZONE_SCOPE.
      
      Furthermore, the non-formalized scope value 0 has always been permitted
      for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
      We now permit it even as binding scope, but for compatibility reasons we
      choose to not change the value of TIPC_CLUSTER_SCOPE.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      928df188
  21. 09 2月, 2018 1 次提交
    • H
      tipc: fix skb truesize/datasize ratio control · 55b3280d
      Hoang Le 提交于
      In commit d618d09a ("tipc: enforce valid ratio between skb truesize
      and contents") we introduced a test for ensuring that the condition
      truesize/datasize <= 4 is true for a received buffer. Unfortunately this
      test has two problems.
      
      - Because of the integer arithmetics the test
        if (skb->truesize / buf_roundup_len(skb) > 4) will miss all
        ratios [4 < ratio < 5], which was not the intention.
      - The buffer returned by skb_copy() inherits skb->truesize of the
        original buffer, which doesn't help the situation at all.
      
      In this commit, we change the ratio condition and replace skb_copy()
      with a call to skb_copy_expand() to finally get this right.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55b3280d
  22. 02 12月, 2017 1 次提交
    • J
      tipc: fall back to smaller MTU if allocation of local send skb fails · 4c94cc2d
      Jon Maloy 提交于
      When sending node local messages the code is using an 'mtu' of 66060
      bytes to avoid unnecessary fragmentation. During situations of low
      memory tipc_msg_build() may sometimes fail to allocate such large
      buffers, resulting in unnecessary send failures. This can easily be
      remedied by falling back to a smaller MTU, and then reassemble the
      buffer chain as if the message were arriving from a remote node.
      
      At the same time, we change the initial MTU setting of the broadcast
      link to a lower value, so that large messages always are fragmented
      into smaller buffers even when we run in single node mode. Apart from
      obtaining the same advantage as for the 'fallback' solution above, this
      turns out to give a significant performance improvement. This can
      probably be explained with the __pskb_copy() operation performed on the
      buffer for each recipient during reception. We found the optimal value
      for this, considering the most relevant skb pool, to be 3744 bytes.
      Acked-by: NYing Xue <ying.xue@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c94cc2d
  23. 16 11月, 2017 1 次提交
    • J
      tipc: enforce valid ratio between skb truesize and contents · d618d09a
      Jon Maloy 提交于
      The socket level flow control is based on the assumption that incoming
      buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
      where the latter value is rounded off upwards to the nearest 1k number.
      This does empirically hold true for the device drivers we know, but we
      cannot trust that it will always be so, e.g., in a system with jumbo
      frames and very small packets.
      
      We now introduce a check for this condition at packet arrival, and if
      we find it to be false, we copy the packet to a new, smaller buffer,
      where the condition will be true. We expect this to affect only a small
      fraction of all incoming packets, if at all.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d618d09a
  24. 13 10月, 2017 1 次提交
  25. 09 10月, 2017 1 次提交
    • J
      tipc: Unclone message at secondary destination lookup · a9e2971b
      Jon Maloy 提交于
      When a bundling message is received, the function tipc_link_input()
      calls function tipc_msg_extract() to unbundle all inner messages of
      the bundling message before adding them to input queue.
      
      The function tipc_msg_extract() just clones all inner skb for all
      inner messagges from the bundling skb. This means that the skb
      headroom of an inner message overlaps with the data part of the
      preceding message in the bundle.
      
      If the message in question is a name addressed message, it may be
      subject to a secondary destination lookup, and eventually be sent out
      on one of the interfaces again. But, since what is perceived as headroom
      by the device driver in reality is the last bytes of the preceding
      message in the bundle, the latter will be overwritten by the MAC
      addresses of the L2 header. If the preceding message has not yet been
      consumed by the user, it will evenually be delivered with corrupted
      contents.
      
      This commit fixes this by uncloning all messages passing through the
      function tipc_msg_lookup_dest(), hence ensuring that the headroom
      is always valid when the message is passed on.
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9e2971b
  26. 01 10月, 2017 1 次提交
  27. 25 8月, 2017 1 次提交
  28. 15 8月, 2017 1 次提交
    • J
      tipc: avoid inheriting msg_non_seq flag when message is returned · 59a361bc
      Jon Paul Maloy 提交于
      In the function msg_reverse(), we reverse the header while trying to
      reuse the original buffer whenever possible. Those rejected/returned
      messages are always transmitted as unicast, but the msg_non_seq field
      is not explicitly set to zero as it should be.
      
      We have seen cases where multicast senders set the message type to
      "NOT dest_droppable", meaning that a multicast message shorter than
      one MTU will be returned, e.g., during receive buffer overflow, by
      reusing the original buffer. This has the effect that even the
      'msg_non_seq' field is inadvertently inherited by the rejected message,
      although it is now sent as a unicast message. This again leads the
      receiving unicast link endpoint to steer the packet toward the broadcast
      link receive function, where it is dropped. The affected unicast link is
      thereafter (after 100 failed retransmissions) declared 'stale' and
      reset.
      
      We fix this by unconditionally setting the 'msg_non_seq' flag to zero
      for all rejected/returned messages.
      Reported-by: NCanh Duc Luu <canh.d.luu@dektech.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59a361bc
  29. 11 6月, 2017 1 次提交
  30. 21 1月, 2017 1 次提交
  31. 17 1月, 2017 1 次提交
  32. 06 12月, 2016 1 次提交
    • A
      [iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8
      Al Viro 提交于
      copy_from_iter_full(), copy_from_iter_full_nocache() and
      csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
      et.al., advancing iterator only in case of successful full copy
      and returning whether it had been successful or not.
      
      Convert some obvious users.  *NOTE* - do not blindly assume that
      something is a good candidate for those unless you are sure that
      not advancing iov_iter in failure case is the right thing in
      this case.  Anything that does short read/short write kind of
      stuff (or is in a loop, etc.) is unlikely to be a good one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbbd26b8
  33. 23 6月, 2016 1 次提交
    • J
      tipc: unclone unbundled buffers before forwarding · 27777daa
      Jon Paul Maloy 提交于
      When extracting an individual message from a received "bundle" buffer,
      we just create a clone of the base buffer, and adjust it to point into
      the right position of the linearized data area of the latter. This works
      well for regular message reception, but during periods of extremely high
      load it may happen that an extracted buffer, e.g, a connection probe, is
      reversed and forwarded through an external interface while the preceding
      extracted message is still unhandled. When this happens, the header or
      data area of the preceding message will be partially overwritten by a
      MAC header, leading to unpredicatable consequences, such as a link
      reset.
      
      We now fix this by ensuring that the msg_reverse() function never
      returns a cloned buffer, and that the returned buffer always contains
      sufficient valid head and tail room to be forwarded.
      Reported-by: NErik Hugne <erik.hugne@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27777daa
  34. 24 10月, 2015 2 次提交
  35. 22 10月, 2015 1 次提交
    • J
      tipc: allow non-linear first fragment buffer · 45c8b7b1
      Jon Paul Maloy 提交于
      The current code for message reassembly is erroneously assuming that
      the the first arriving fragment buffer always is linear, and then goes
      ahead resetting the fragment list of that buffer in anticipation of
      more arriving fragments.
      
      However, if the buffer already happens to be non-linear, we will
      inadvertently drop the already attached fragment list, and later
      on trig a BUG() in __pskb_pull_tail().
      
      We see this happen when running fragmented TIPC multicast across UDP,
      something made possible since
      commit d0f91938 ("tipc: add ip/udp media type")
      
      We fix this by not resetting the fragment list when the buffer is non-
      linear, and by initiatlizing our private fragment list tail pointer to
      the tail of the existing fragment list.
      
      Fixes: commit d0f91938 ("tipc: add ip/udp media type")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45c8b7b1
  36. 16 10月, 2015 1 次提交
    • J
      tipc: disallow packet duplicates in link deferred queue · 8306f99a
      Jon Paul Maloy 提交于
      After the previous commits, we are guaranteed that no packets
      of type LINK_PROTOCOL or with illegal sequence numbers will be
      attempted added to the link deferred queue. This makes it possible to
      make some simplifications to the sorting algorithm in the function
      tipc_skb_queue_sorted().
      
      We also alter the function so that it will drop packets if one with
      the same seqeunce number is already present in the queue. This is
      necessary because we have identified weird packet sequences, involving
      duplicate packets, where a legitimate in-sequence packet may advance to
      the head of the queue without being detected and de-queued.
      
      Finally, we make this function outline, since it will now be called only
      in exceptional cases.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8306f99a
  37. 21 9月, 2015 1 次提交