1. 24 11月, 2019 1 次提交
  2. 31 10月, 2019 1 次提交
    • J
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy 提交于
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bceb97
  3. 30 10月, 2019 1 次提交
    • H
      tipc: improve throughput between nodes in netns · f73b1281
      Hoang Le 提交于
      Currently, TIPC transports intra-node user data messages directly
      socket to socket, hence shortcutting all the lower layers of the
      communication stack. This gives TIPC very good intra node performance,
      both regarding throughput and latency.
      
      We now introduce a similar mechanism for TIPC data traffic across
      network namespaces located in the same kernel. On the send path, the
      call chain is as always accompanied by the sending node's network name
      space pointer. However, once we have reliably established that the
      receiving node is represented by a namespace on the same host, we just
      replace the namespace pointer with the receiving node/namespace's
      ditto, and follow the regular socket receive patch though the receiving
      node. This technique gives us a throughput similar to the node internal
      throughput, several times larger than if we let the traffic go though
      the full network stacks. As a comparison, max throughput for 64k
      messages is four times larger than TCP throughput for the same type of
      traffic.
      
      To meet any security concerns, the following should be noted.
      
      - All nodes joining a cluster are supposed to have been be certified
      and authenticated by mechanisms outside TIPC. This is no different for
      nodes/namespaces on the same host; they have to auto discover each
      other using the attached interfaces, and establish links which are
      supervised via the regular link monitoring mechanism. Hence, a kernel
      local node has no other way to join a cluster than any other node, and
      have to obey to policies set in the IP or device layers of the stack.
      
      - Only when a sender has established with 100% certainty that the peer
      node is located in a kernel local namespace does it choose to let user
      data messages, and only those, take the crossover path to the receiving
      node/namespace.
      
      - If the receiving node/namespace is removed, its namespace pointer
      is invalidated at all peer nodes, and their neighbor link monitoring
      will eventually note that this node is gone.
      
      - To ensure the "100% certainty" criteria, and prevent any possible
      spoofing, received discovery messages must contain a proof that the
      sender knows a common secret. We use the hash mix of the sending
      node/namespace for this purpose, since it can be accessed directly by
      all other namespaces in the kernel. Upon reception of a discovery
      message, the receiver checks this proof against all the local
      namespaces'hash_mix:es. If it finds a match, that, along with a
      matching node id and cluster id, this is deemed sufficient proof that
      the peer node in question is in a local namespace, and a wormhole can
      be opened.
      
      - We should also consider that TIPC is intended to be a cluster local
      IPC mechanism (just like e.g. UNIX sockets) rather than a network
      protocol, and hence we think it can justified to allow it to shortcut the
      lower protocol layers.
      
      Regarding traceability, we should notice that since commit 6c9081a3
      ("tipc: add loopback device tracking") it is possible to follow the node
      internal packet flow by just activating tcpdump on the loopback
      interface. This will be true even for this mechanism; by activating
      tcpdump on the involved nodes' loopback interfaces their inter-name
      space messaging can easily be tracked.
      
      v2:
      - update 'net' pointer when node left/rejoined
      v3:
      - grab read/write lock when using node ref obj
      v4:
      - clone traffics between netns to loopback
      Suggested-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f73b1281
  4. 29 10月, 2019 1 次提交
  5. 10 10月, 2019 2 次提交
    • E
      net: silence KCSAN warnings about sk->sk_backlog.len reads · 70c26558
      Eric Dumazet 提交于
      sk->sk_backlog.len can be written by BH handlers, and read
      from process contexts in a lockless way.
      
      Note the write side should also use WRITE_ONCE() or a variant.
      We need some agreement about the best way to do this.
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_grow_window.isra.0
      
      write to 0xffff88812665f32c of 4 bytes by interrupt on cpu 1:
       sk_add_backlog include/net/sock.h:934 [inline]
       tcp_add_backlog+0x4a0/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812665f32c of 4 bytes by task 7292 on cpu 0:
       tcp_space include/net/tcp.h:1373 [inline]
       tcp_grow_window.isra.0+0x6b/0x480 net/ipv4/tcp_input.c:413
       tcp_event_data_recv+0x68f/0x990 net/ipv4/tcp_input.c:717
       tcp_rcv_established+0xbfe/0xf50 net/ipv4/tcp_input.c:5618
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       sk_backlog_rcv include/net/sock.h:945 [inline]
       __release_sock+0x135/0x1e0 net/core/sock.c:2427
       release_sock+0x61/0x160 net/core/sock.c:2943
       tcp_recvmsg+0x63b/0x1a30 net/ipv4/tcp.c:2181
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7292 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      70c26558
    • E
      net: silence KCSAN warnings around sk_add_backlog() calls · 8265792b
      Eric Dumazet 提交于
      sk_add_backlog() callers usually read sk->sk_rcvbuf without
      owning the socket lock. This means sk_rcvbuf value can
      be changed by other cpus, and KCSAN complains.
      
      Add READ_ONCE() annotations to document the lockless nature
      of these reads.
      
      Note that writes over sk_rcvbuf should also use WRITE_ONCE(),
      but this will be done in separate patches to ease stable
      backports (if we decide this is relevant for stable trees).
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff88812ab369f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:902 [inline]
       sk_add_backlog include/net/sock.h:933 [inline]
       tcp_add_backlog+0x45a/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812ab369f8 of 8 bytes by task 7271 on cpu 0:
       tcp_recvmsg+0x470/0x1a30 net/ipv4/tcp.c:2047
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7271 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      8265792b
  6. 06 10月, 2019 1 次提交
  7. 19 8月, 2019 1 次提交
    • J
      tipc: clean up skb list lock handling on send path · e654f9f5
      Jon Maloy 提交于
      The policy for handling the skb list locks on the send and receive paths
      is simple.
      
      - On the send path we never need to grab the lock on the 'xmitq' list
        when the destination is an exernal node.
      
      - On the receive path we always need to grab the lock on the 'inputq'
        list, irrespective of source node.
      
      However, when transmitting node local messages those will eventually
      end up on the receive path of a local socket, meaning that the argument
      'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in  the
      function tipc_sk_rcv(). This has been handled by always initializing
      the spinlock of the 'xmitq' list at message creation, just in case it
      may end up on the receive path later, and despite knowing that the lock
      in most cases never will be used.
      
      This approach is inaccurate and confusing, and has also concealed the
      fact that the stated 'no lock grabbing' policy for the send path is
      violated in some cases.
      
      We now clean up this by never initializing the lock at message creation,
      instead doing this at the moment we find that the message actually will
      enter the receive path. At the same time we fix the four locations
      where we incorrectly access the spinlock on the send/error path.
      
      This patch also reverts commit d12cffe9 ("tipc: ensure head->lock
      is initialised") which has now become redundant.
      
      CC: Eric Dumazet <edumazet@google.com>
      Reported-by: NChris Packham <chris.packham@alliedtelesis.co.nz>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e654f9f5
  8. 31 7月, 2019 1 次提交
    • J
      tipc: fix unitilized skb list crash · 2948a1fc
      Jon Maloy 提交于
      Our test suite somtimes provokes the following crash:
      
      Description of problem:
      [ 1092.597234] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8
      [ 1092.605072] PGD 0 P4D 0
      [ 1092.607620] Oops: 0000 [#1] SMP PTI
      [ 1092.611118] CPU: 37 PID: 0 Comm: swapper/37 Kdump: loaded Not tainted 4.18.0-122.el8.x86_64 #1
      [ 1092.619724] Hardware name: Dell Inc. PowerEdge R740/08D89F, BIOS 1.3.7 02/08/2018
      [ 1092.627215] RIP: 0010:tipc_mcast_filter_msg+0x93/0x2d0 [tipc]
      [ 1092.632955] Code: 0f 84 aa 01 00 00 89 cf 4d 01 ca 4c 8b 26 c1 ef 19 83 e7 0f 83 ff 0c 4d 0f 45 d1 41 8b 6a 10 0f cd 4c 39 e6 0f 84 81 01 00 00 <4d> 8b 9c 24 e8 00 00 00 45 8b 13 41 0f ca 44 89 d7 c1 ef 13 83 e7
      [ 1092.651703] RSP: 0018:ffff929e5fa83a18 EFLAGS: 00010282
      [ 1092.656927] RAX: ffff929e3fb38100 RBX: 00000000069f29ee RCX: 00000000416c0045
      [ 1092.664058] RDX: ffff929e5fa83a88 RSI: ffff929e31a28420 RDI: 0000000000000000
      [ 1092.671209] RBP: 0000000029b11821 R08: 0000000000000000 R09: ffff929e39b4407a
      [ 1092.678343] R10: ffff929e39b4407a R11: 0000000000000007 R12: 0000000000000000
      [ 1092.685475] R13: 0000000000000001 R14: ffff929e3fb38100 R15: ffff929e39b4407a
      [ 1092.692614] FS:  0000000000000000(0000) GS:ffff929e5fa80000(0000) knlGS:0000000000000000
      [ 1092.700702] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1092.706447] CR2: 00000000000000e8 CR3: 000000031300a004 CR4: 00000000007606e0
      [ 1092.713579] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1092.720712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1092.727843] PKRU: 55555554
      [ 1092.730556] Call Trace:
      [ 1092.733010]  <IRQ>
      [ 1092.735034]  tipc_sk_filter_rcv+0x7ca/0xb80 [tipc]
      [ 1092.739828]  ? __kmalloc_node_track_caller+0x1cb/0x290
      [ 1092.744974]  ? dev_hard_start_xmit+0xa5/0x210
      [ 1092.749332]  tipc_sk_rcv+0x389/0x640 [tipc]
      [ 1092.753519]  tipc_sk_mcast_rcv+0x23c/0x3a0 [tipc]
      [ 1092.758224]  tipc_rcv+0x57a/0xf20 [tipc]
      [ 1092.762154]  ? ktime_get_real_ts64+0x40/0xe0
      [ 1092.766432]  ? tpacket_rcv+0x50/0x9f0
      [ 1092.770098]  tipc_l2_rcv_msg+0x4a/0x70 [tipc]
      [ 1092.774452]  __netif_receive_skb_core+0xb62/0xbd0
      [ 1092.779164]  ? enqueue_entity+0xf6/0x630
      [ 1092.783084]  ? kmem_cache_alloc+0x158/0x1c0
      [ 1092.787272]  ? __build_skb+0x25/0xd0
      [ 1092.790849]  netif_receive_skb_internal+0x42/0xf0
      [ 1092.795557]  napi_gro_receive+0xba/0xe0
      [ 1092.799417]  mlx5e_handle_rx_cqe+0x83/0xd0 [mlx5_core]
      [ 1092.804564]  mlx5e_poll_rx_cq+0xd5/0x920 [mlx5_core]
      [ 1092.809536]  mlx5e_napi_poll+0xb2/0xce0 [mlx5_core]
      [ 1092.814415]  ? __wake_up_common_lock+0x89/0xc0
      [ 1092.818861]  net_rx_action+0x149/0x3b0
      [ 1092.822616]  __do_softirq+0xe3/0x30a
      [ 1092.826193]  irq_exit+0x100/0x110
      [ 1092.829512]  do_IRQ+0x85/0xd0
      [ 1092.832483]  common_interrupt+0xf/0xf
      [ 1092.836147]  </IRQ>
      [ 1092.838255] RIP: 0010:cpuidle_enter_state+0xb7/0x2a0
      [ 1092.843221] Code: e8 3e 79 a5 ff 80 7c 24 03 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 d7 01 00 00 31 ff e8 a0 6b ab ff fb 66 0f 1f 44 00 00 <48> b8 ff ff ff ff f3 01 00 00 4c 29 f3 ba ff ff ff 7f 48 39 c3 7f
      [ 1092.861967] RSP: 0018:ffffaa5ec6533e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
      [ 1092.869530] RAX: ffff929e5faa3100 RBX: 000000fe63dd2092 RCX: 000000000000001f
      [ 1092.876665] RDX: 000000fe63dd2092 RSI: 000000003a518aaa RDI: 0000000000000000
      [ 1092.883795] RBP: 0000000000000003 R08: 0000000000000004 R09: 0000000000022940
      [ 1092.890929] R10: 0000040cb0666b56 R11: ffff929e5faa20a8 R12: ffff929e5faade78
      [ 1092.898060] R13: ffffffffb59258f8 R14: 000000fe60f3228d R15: 0000000000000000
      [ 1092.905196]  ? cpuidle_enter_state+0x92/0x2a0
      [ 1092.909555]  do_idle+0x236/0x280
      [ 1092.912785]  cpu_startup_entry+0x6f/0x80
      [ 1092.916715]  start_secondary+0x1a7/0x200
      [ 1092.920642]  secondary_startup_64+0xb7/0xc0
      [...]
      
      The reason is that the skb list tipc_socket::mc_method.deferredq only
      is initialized for connectionless sockets, while nothing stops arriving
      multicast messages from being filtered by connection oriented sockets,
      with subsequent access to the said list.
      
      We fix this by initializing the list unconditionally at socket creation.
      This eliminates the crash, while the message still is dropped further
      down in tipc_sk_filter_rcv() as it should be.
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2948a1fc
  9. 10 5月, 2019 1 次提交
  10. 28 4月, 2019 2 次提交
    • J
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg 提交于
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8cb08174
    • M
      netlink: make nla_nest_start() add NLA_F_NESTED flag · ae0be8de
      Michal Kubecek 提交于
      Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
      netlink based interfaces (including recently added ones) are still not
      setting it in kernel generated messages. Without the flag, message parsers
      not aware of attribute semantics (e.g. wireshark dissector or libmnl's
      mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
      the structure of their contents.
      
      Unfortunately we cannot just add the flag everywhere as there may be
      userspace applications which check nlattr::nla_type directly rather than
      through a helper masking out the flags. Therefore the patch renames
      nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
      as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
      are rewritten to use nla_nest_start().
      
      Except for changes in include/net/netlink.h, the patch was generated using
      this semantic patch:
      
      @@ expression E1, E2; @@
      -nla_nest_start(E1, E2)
      +nla_nest_start_noflag(E1, E2)
      
      @@ expression E1, E2; @@
      -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
      +nla_nest_start(E1, E2)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae0be8de
  11. 20 4月, 2019 1 次提交
  12. 22 3月, 2019 2 次提交
    • H
      tipc: fix a null pointer deref · 08e046c8
      Hoang Le 提交于
      In commit c55c8eda ("tipc: smooth change between replicast and
      broadcast") we introduced new method to eliminate the risk of message
      reordering that happen in between different nodes.
      Unfortunately, we forgot checking at receiving side to ignore intra node.
      
      We fix this by checking and returning if arrived message from intra node.
      
      syzbot report:
      
      ==================================================================
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS Google 01/01/2011
      RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
      Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
       24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
       <80> 3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
      RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
      RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
      RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
      RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
      R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
      R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
      FS:  000000000106a880(0000) GS:ffff8880ae800000(0000)
       knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
       tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
       tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
       tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
       tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
       tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
       __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
       tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:661
       ___sys_sendmsg+0x806/0x930 net/socket.c:2260
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2298
       __do_sys_sendmsg net/socket.c:2307 [inline]
       __se_sys_sendmsg net/socket.c:2305 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401c9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
       48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
       <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
      RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
      R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
      Modules linked in:
      ---[ end trace ba79875754e1708f ]---
      
      Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08e046c8
    • H
      tipc: fix use-after-free in tipc_sk_filter_rcv · 77d5ad40
      Hoang Le 提交于
      skb free-ed in:
        1/ condition 1: tipc_sk_filter_rcv -> tipc_sk_proto_rcv
        2/ condition 2: tipc_sk_filter_rcv -> tipc_group_filter_msg
      This leads to a "use-after-free" access in the next condition.
      
      We fix this by intializing the variable at declaration, then it is safe
      to check this variable to continue processing if condition matches.
      
      syzbot report:
      
      ==================================================================
      BUG: KASAN: use-after-free in tipc_sk_filter_rcv+0x2166/0x34f0
       net/tipc/socket.c:2167
      Read of size 4 at addr ffff88808ea58534 by task kworker/u4:0/7
      
      CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.0.0+ #61
      Hardware name: Google Google Compute Engine/Google Compute Engine,
       BIOS Google 01/01/2011
      Workqueue: tipc_send tipc_conn_send_work
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
       tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167
       tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
       tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
       tipc_topsrv_kern_evt+0x3b7/0x580 net/tipc/topsrv.c:610
       tipc_conn_send_to_sock+0x43e/0x5f0 net/tipc/topsrv.c:283
       tipc_conn_send_work+0x65/0x80 net/tipc/topsrv.c:303
       process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x357/0x430 kernel/kthread.c:253
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      
      Reported-by: syzbot+e863893591cc7a622e40@syzkaller.appspotmail.com
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77d5ad40
  13. 20 3月, 2019 1 次提交
    • H
      tipc: smooth change between replicast and broadcast · c55c8eda
      Hoang Le 提交于
      Currently, a multicast stream may start out using replicast, because
      there are few destinations, and then it should ideally switch to
      L2/broadcast IGMP/multicast when the number of destinations grows beyond
      a certain limit. The opposite should happen when the number decreases
      below the limit.
      
      To eliminate the risk of message reordering caused by method change,
      a sending socket must stick to a previously selected method until it
      enters an idle period of 5 seconds. Means there is a 5 seconds pause
      in the traffic from the sender socket.
      
      If the sender never makes such a pause, the method will never change,
      and transmission may become very inefficient as the cluster grows.
      
      With this commit, we allow such a switch between replicast and
      broadcast without any need for a traffic pause.
      
      Solution is to send a dummy message with only the header, also with
      the SYN bit set, via broadcast or replicast. For the data message,
      the SYN bit is set and sending via replicast or broadcast (inverse
      method with dummy).
      
      Then, at receiving side any messages follow first SYN bit message
      (data or dummy message), they will be held in deferred queue until
      another pair (dummy or data message) arrived in other link.
      
      v2: reverse christmas tree declaration
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c55c8eda
  14. 18 3月, 2019 1 次提交
  15. 17 3月, 2019 1 次提交
  16. 06 3月, 2019 1 次提交
  17. 27 2月, 2019 1 次提交
    • T
      tipc: fix race condition causing hung sendto · bfd07f3d
      Tung Nguyen 提交于
      When sending multicast messages via blocking socket,
      if sending link is congested (tsk->cong_link_cnt is set to 1),
      the sending thread will be put into sleeping state. However,
      tipc_sk_filter_rcv() is called under socket spin lock but
      tipc_wait_for_cond() is not. So, there is no guarantee that
      the setting of tsk->cong_link_cnt to 0 in tipc_sk_proto_rcv() in
      CPU-1 will be perceived by CPU-0. If that is the case, the sending
      thread in CPU-0 after being waken up, will continue to see
      tsk->cong_link_cnt as 1 and put the sending thread into sleeping
      state again. The sending thread will sleep forever.
      
      CPU-0                                | CPU-1
      tipc_wait_for_cond()                 |
      {                                    |
       // condition_ = !tsk->cong_link_cnt |
       while ((rc_ = !(condition_))) {     |
        ...                                |
        release_sock(sk_);                 |
        wait_woken();                      |
                                           | if (!sock_owned_by_user(sk))
                                           |  tipc_sk_filter_rcv()
                                           |  {
                                           |   ...
                                           |   tipc_sk_proto_rcv()
                                           |   {
                                           |    ...
                                           |    tsk->cong_link_cnt--;
                                           |    ...
                                           |    sk->sk_write_space(sk);
                                           |    ...
                                           |   }
                                           |   ...
                                           |  }
        sched_annotate_sleep();            |
        lock_sock(sk_);                    |
        remove_wait_queue();               |
       }                                   |
      }                                    |
      
      This commit fixes it by adding memory barrier to tipc_sk_proto_rcv()
      and tipc_wait_for_cond().
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfd07f3d
  18. 22 2月, 2019 2 次提交
  19. 24 1月, 2019 1 次提交
    • G
      tipc: mark expected switch fall-throughs · f79e3365
      Gustavo A. R. Silva 提交于
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      This patch fixes the following warnings:
      
      net/tipc/link.c:1125:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      net/tipc/socket.c:736:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      net/tipc/socket.c:2418:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enabling
      -Wimplicit-fallthrough.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f79e3365
  20. 20 12月, 2018 2 次提交
    • T
      tipc: add trace_events for tipc socket · 01e661eb
      Tuong Lien 提交于
      The commit adds the new trace_events for TIPC socket object:
      
      trace_tipc_sk_create()
      trace_tipc_sk_poll()
      trace_tipc_sk_sendmsg()
      trace_tipc_sk_sendmcast()
      trace_tipc_sk_sendstream()
      trace_tipc_sk_filter_rcv()
      trace_tipc_sk_advance_rx()
      trace_tipc_sk_rej_msg()
      trace_tipc_sk_drop_msg()
      trace_tipc_sk_release()
      trace_tipc_sk_shutdown()
      trace_tipc_sk_overlimit1()
      trace_tipc_sk_overlimit2()
      
      Also, enables the traces for the following cases:
      - When user creates a TIPC socket;
      - When user calls poll() on TIPC socket;
      - When user sends a dgram/mcast/stream message.
      - When a message is put into the socket 'sk_receive_queue';
      - When a message is released from the socket 'sk_receive_queue';
      - When a message is rejected (e.g. due to no port, invalid, etc.);
      - When a message is dropped (e.g. due to wrong message type);
      - When socket is released;
      - When socket is shutdown;
      - When socket rcvq's allocation is overlimit (> 90%);
      - When socket rcvq + bklq's allocation is overlimit (> 90%);
      - When the 'TIPC_ERR_OVERLOAD/2' issue happens;
      
      Note:
      a) All the socket traces are designed to be able to trace on a specific
      socket by either using the 'event filtering' feature on a known socket
      'portid' value or the sysctl file:
      
      /proc/sys/net/tipc/sk_filter
      
      The file determines a 'tuple' for what socket should be traced:
      
      (portid, sock type, name type, name lower, name upper)
      
      where:
      + 'portid' is the socket portid generated at socket creating, can be
      found in the trace outputs or the 'tipc socket list' command printouts;
      + 'sock type' is the socket type (1 = SOCK_TREAM, ...);
      + 'name type', 'name lower' and 'name upper' are the service name being
      connected to or published by the socket.
      
      Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
      the traces happen for every sockets with no filter.
      
      b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
      which happens when the socket receive queue (and backlog queue) is
      about to be overloaded, when the queue allocation is > 90%. Then, when
      the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
      issue can be traced.
      
      The trace event is designed as an 'upper watermark' notification that
      the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
      actions can be triggerred in the meanwhile to see what is going on with
      the socket queue.
      
      In addition, the 'trace_tipc_sk_dump()' is also placed at the
      'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
      for post-analysis.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Tested-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01e661eb
    • T
      tipc: enable tracepoints in tipc · b4b9771b
      Tuong Lien 提交于
      As for the sake of debugging/tracing, the commit enables tracepoints in
      TIPC along with some general trace_events as shown below. It also
      defines some 'tipc_*_dump()' functions that allow to dump TIPC object
      data whenever needed, that is, for general debug purposes, ie. not just
      for the trace_events.
      
      The following trace_events are now available:
      
      - trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
        e.g. message type, user, droppable, skb truesize, cloned skb, etc.
      
      - trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
        queues, e.g. TIPC link transmq, socket receive queue, etc.
      
      - trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
        sk state, sk type, connection type, rmem_alloc, socket queues, etc.
      
      - trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
        link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
      
      - trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
        node state, active links, capabilities, link entries, etc.
      
      How to use:
      Put the trace functions at any places where we want to dump TIPC data
      or events.
      
      Note:
      a) The dump functions will generate raw data only, that is, to offload
      the trace event's processing, it can require a tool or script to parse
      the data but this should be simple.
      
      b) The trace_tipc_*_dump() should be reserved for a failure cases only
      (e.g. the retransmission failure case) or where we do not expect to
      happen too often, then we can consider enabling these events by default
      since they will almost not take any effects under normal conditions,
      but once the rare condition or failure occurs, we get the dumped data
      fully for post-analysis.
      
      For other trace purposes, we can reuse these trace classes as template
      but different events.
      
      c) A trace_event is only effective when we enable it. To enable the
      TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
      directory in the 'debugfs' file system. Normally, they are located at:
      
      /sys/kernel/debug/tracing/events/tipc/
      
      For example:
      
      To enable the tipc_link_dump event:
      
      echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
      
      To enable all the TIPC trace_events:
      
      echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
      
      To collect the trace data:
      
      cat trace
      
      or
      
      cat trace_pipe > /trace.out &
      
      To disable all the TIPC trace_events:
      
      echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
      
      To clear the trace buffer:
      
      echo > trace
      
      d) Like the other trace_events, the feature like 'filter' or 'trigger'
      is also usable for the tipc trace_events.
      For more details, have a look at:
      
      Documentation/trace/ftrace.txt
      
      MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Tested-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4b9771b
  21. 19 12月, 2018 1 次提交
  22. 15 12月, 2018 2 次提交
    • C
      tipc: check tsk->group in tipc_wait_for_cond() · 143ece65
      Cong Wang 提交于
      tipc_wait_for_cond() drops socket lock before going to sleep,
      but tsk->group could be freed right after that release_sock().
      So we have to re-check and reload tsk->group after it wakes up.
      
      After this patch, tipc_wait_for_cond() returns -ERESTARTSYS when
      tsk->group is NULL, instead of continuing with the assumption of
      a non-NULL tsk->group.
      
      (It looks like 'dsts' should be re-checked and reloaded too, but
      it is a different bug.)
      
      Similar for tipc_send_group_unicast() and tipc_send_group_anycast().
      
      Reported-by: syzbot+10a9db47c3a0e13eb31c@syzkaller.appspotmail.com
      Fixes: b7d42635 ("tipc: introduce flow control for group broadcast messages")
      Fixes: ee106d7f ("tipc: introduce group anycast messaging")
      Fixes: 27bd9ec0 ("tipc: introduce group unicast messaging")
      Cc: Ying Xue <ying.xue@windriver.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      143ece65
    • C
      tipc: use lock_sock() in tipc_sk_reinit() · 15ef70e2
      Cong Wang 提交于
      lock_sock() must be used in process context to be race-free with
      other lock_sock() callers, for example, tipc_release(). Otherwise
      using the spinlock directly can't serialize a parallel tipc_release().
      
      As it is blocking, we have to hold the sock refcnt before
      rhashtable_walk_stop() and release it after rhashtable_walk_start().
      
      Fixes: 07f6c4bc ("tipc: convert tipc reference table to use generic rhashtable")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15ef70e2
  23. 18 11月, 2018 1 次提交
  24. 24 10月, 2018 1 次提交
    • K
      Revert "net: simplify sock_poll_wait" · 89ab066d
      Karsten Graul 提交于
      This reverts commit dd979b4d.
      
      This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
      internal TCP socket for the initial handshake with the remote peer.
      Whenever the SMC connection can not be established this TCP socket is
      used as a fallback. All socket operations on the SMC socket are then
      forwarded to the TCP socket. In case of poll, the file->private_data
      pointer references the SMC socket because the TCP socket has no file
      assigned. This causes tcp_poll to wait on the wrong socket.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89ab066d
  25. 11 10月, 2018 1 次提交
  26. 30 9月, 2018 4 次提交
  27. 26 9月, 2018 1 次提交
  28. 07 9月, 2018 1 次提交
    • C
      tipc: call start and done ops directly in __tipc_nl_compat_dumpit() · 8f5c5fcf
      Cong Wang 提交于
      __tipc_nl_compat_dumpit() uses a netlink_callback on stack,
      so the only way to align it with other ->dumpit() call path
      is calling tipc_dump_start() and tipc_dump_done() directly
      inside it. Otherwise ->dumpit() would always get NULL from
      cb->args[].
      
      But tipc_dump_start() uses sock_net(cb->skb->sk) to retrieve
      net pointer, the cb->skb here doesn't set skb->sk, the net pointer
      is saved in msg->net instead, so introduce a helper function
      __tipc_dump_start() to pass in msg->net.
      
      Ying pointed out cb->args[0...3] are already used by other
      callbacks on this call path, so we can't use cb->args[0] any
      more, use cb->args[4] instead.
      
      Fixes: 9a07efa9 ("tipc: switch to rhashtable iterator")
      Reported-and-tested-by: syzbot+e93a2c41f91b8e2c7d9b@syzkaller.appspotmail.com
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f5c5fcf
  29. 06 9月, 2018 1 次提交
  30. 30 8月, 2018 2 次提交
    • C
      tipc: switch to rhashtable iterator · 9a07efa9
      Cong Wang 提交于
      syzbot reported a use-after-free in tipc_group_fill_sock_diag(),
      where tipc_group_fill_sock_diag() still reads tsk->group meanwhile
      tipc_group_delete() just deletes it in tipc_release().
      
      tipc_nl_sk_walk() aims to lock this sock when walking each sock
      in the hash table to close race conditions with sock changes like
      this one, by acquiring tsk->sk.sk_lock.slock spinlock, unfortunately
      this doesn't work at all. All non-BH call path should take
      lock_sock() instead to make it work.
      
      tipc_nl_sk_walk() brutally iterates with raw rht_for_each_entry_rcu()
      where RCU read lock is required, this is the reason why lock_sock()
      can't be taken on this path. This could be resolved by switching to
      rhashtable iterator API's, where taking a sleepable lock is possible.
      Also, the iterator API's are friendly for restartable calls like
      diag dump, the last position is remembered behind the scence,
      all we need to do here is saving the iterator into cb->args[].
      
      I tested this with parallel tipc diag dump and thousands of tipc
      socket creation and release, no crash or memory leak.
      
      Reported-by: syzbot+b9c8f3ab2994b7cd1625@syzkaller.appspotmail.com
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a07efa9
    • C
      tipc: fix a missing rhashtable_walk_exit() · bd583fe3
      Cong Wang 提交于
      rhashtable_walk_exit() must be paired with rhashtable_walk_enter().
      
      Fixes: 40f9f439 ("tipc: Fix tipc_sk_reinit race conditions")
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd583fe3