1. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  2. 26 5月, 2018 1 次提交
  3. 11 5月, 2018 1 次提交
    • E
      tipc: fix one byte leak in tipc_sk_set_orig_addr() · 09c8b971
      Eric Dumazet 提交于
      sysbot/KMSAN reported an uninit-value in recvmsg() that
      I tracked down to tipc_sk_set_orig_addr(), missing
      srcaddr->member.scope initialization.
      
      This patches moves srcaddr->sock.scope init to follow
      fields order and ease future verifications.
      
      BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184 [inline]
      BUG: KMSAN: uninit-value in move_addr_to_user+0x32e/0x530 net/socket.c:226
      CPU: 0 PID: 4549 Comm: syz-executor287 Not tainted 4.17.0-rc3+ #88
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:113
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       kmsan_internal_check_memory+0x135/0x1e0 mm/kmsan/kmsan.c:1157
       kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
       copy_to_user include/linux/uaccess.h:184 [inline]
       move_addr_to_user+0x32e/0x530 net/socket.c:226
       ___sys_recvmsg+0x4e2/0x810 net/socket.c:2285
       __sys_recvmsg net/socket.c:2328 [inline]
       __do_sys_recvmsg net/socket.c:2338 [inline]
       __se_sys_recvmsg net/socket.c:2335 [inline]
       __x64_sys_recvmsg+0x325/0x460 net/socket.c:2335
       do_syscall_64+0x154/0x220 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x4455e9
      RSP: 002b:00007fe3bd36ddb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
      RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 00000000004455e9
      RDX: 0000000000002002 RSI: 0000000020000400 RDI: 0000000000000003
      RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fff98ce4b6f R14: 00007fe3bd36e9c0 R15: 0000000000000003
      
      Local variable description: ----addr@___sys_recvmsg
      Variable was created at:
       ___sys_recvmsg+0xd5/0x810 net/socket.c:2246
       __sys_recvmsg net/socket.c:2328 [inline]
       __do_sys_recvmsg net/socket.c:2338 [inline]
       __se_sys_recvmsg net/socket.c:2335 [inline]
       __x64_sys_recvmsg+0x325/0x460 net/socket.c:2335
      
      Byte 19 of 32 is uninitialized
      
      Fixes: 31c82a2d ("tipc: add second source address to recvmsg()/recvfrom()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09c8b971
  4. 27 4月, 2018 1 次提交
  5. 13 4月, 2018 1 次提交
    • J
      tipc: fix missing initializer in tipc_sendmsg() · 335b929b
      Jon Maloy 提交于
      The stack variable 'dnode' in __tipc_sendmsg() may theoretically
      end up tipc_node_get_mtu() as an unitilalized variable.
      
      We fix this by intializing the variable at declaration. We also add
      a default else clause to the two conditional ones already there, so
      that we never end up in the named function if the given address
      type is illegal.
      
      Reported-by: syzbot+b0975ce9355b347c1546@syzkaller.appspotmail.com
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      335b929b
  6. 09 4月, 2018 1 次提交
    • C
      tipc: use the right skb in tipc_sk_fill_sock_diag() · e41f0548
      Cong Wang 提交于
      Commit 4b2e6877 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
      tried to fix the crash but failed, the crash is still 100% reproducible
      with it.
      
      In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
      correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
      we should use NETLINK_CB(cb->skb).sk here.
      
      Reported-by: <syzbot+326e587eff1074657718@syzkaller.appspotmail.com>
      Fixes: 4b2e6877 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
      Fixes: c30b70de (tipc: implement socket diagnostics for AF_TIPC)
      Cc: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e41f0548
  7. 04 4月, 2018 1 次提交
  8. 01 4月, 2018 1 次提交
    • J
      tipc: permit overlapping service ranges in name table · 37922ea4
      Jon Maloy 提交于
      With the new RB tree structure for service ranges it becomes possible to
      solve an old problem; - we can now allow overlapping service ranges in
      the table.
      
      When inserting a new service range to the tree, we use 'lower' as primary
      key, and when necessary 'upper' as secondary key.
      
      Since there may now be multiple service ranges matching an indicated
      'lower' value, we must also add the 'upper' value to the functions
      used for removing publications, so that the correct, corresponding
      range item can be found.
      
      These changes guarantee that a well-formed publication/withdrawal item
      from a peer node never will be rejected, and make it possible to
      eliminate the problematic backlog functionality we currently have for
      handling such cases.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37922ea4
  9. 24 3月, 2018 1 次提交
  10. 23 3月, 2018 3 次提交
  11. 18 3月, 2018 2 次提交
    • J
      tipc: some name changes · e50e73e1
      Jon Maloy 提交于
      We rename some lists and fields in struct publication both to make
      the naming more consistent and to better reflect their roles. We
      also update the descriptions of those lists.
      
      node_list -> local_publ
      cluster_list -> all_publ
      pport_list -> binding_sock
      ref -> port
      
      There are no functional changes in this commit.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e50e73e1
    • J
      tipc: obsolete TIPC_ZONE_SCOPE · 928df188
      Jon Maloy 提交于
      Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
      aspects handled the same way, both on the publishing node and on the
      receiving nodes.
      
      Despite previous ambitions to the contrary, this is never going to change,
      so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
      macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
      using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
      remain compatible with users and remote nodes still using ZONE_SCOPE.
      
      Furthermore, the non-formalized scope value 0 has always been permitted
      for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
      We now permit it even as binding scope, but for compatibility reasons we
      choose to not change the value of TIPC_CLUSTER_SCOPE.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      928df188
  12. 28 2月, 2018 1 次提交
    • J
      tipc: correct initial value for group congestion flag · 1b22bcad
      Jon Maloy 提交于
      In commit 60c25306 ("tipc: fix race between poll() and
      setsockopt()") we introduced a pointer from struct tipc_group to the
      'group_is_connected' flag in struct tipc_sock, so that this field can
      be checked without dereferencing the group pointer of the latter struct.
      
      The initial value for this flag is correctly set to 'false' when a
      group is created, but we miss the case when no group is created at
      all, in which case the initial value should be 'true'. This has the
      effect that SOCK_RDM/DGRAM sockets sending datagrams never receive
      POLLOUT if they request so.
      
      This commit corrects this bug.
      
      Fixes: 60c25306 ("tipc: fix race between poll() and setsockopt()")
      Reported-by: NHoang Le <hoang.h.le@dektek.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b22bcad
  13. 13 2月, 2018 1 次提交
    • D
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko 提交于
      Changes since v1:
      Added changes in these files:
          drivers/infiniband/hw/usnic/usnic_transport.c
          drivers/staging/lustre/lnet/lnet/lib-socket.c
          drivers/target/iscsi/iscsi_target_login.c
          drivers/vhost/net.c
          fs/dlm/lowcomms.c
          fs/ocfs2/cluster/tcp.c
          security/tomoyo/network.c
      
      Before:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      
      Userspace API is not changed.
      
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b2c45d4
  14. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  15. 20 1月, 2018 1 次提交
    • J
      tipc: fix race between poll() and setsockopt() · 60c25306
      Jon Maloy 提交于
      Letting tipc_poll() dereference a socket's pointer to struct tipc_group
      entails a race risk, as the group item may be deleted in a concurrent
      tipc_sk_join() or tipc_sk_leave() thread.
      
      We now move the 'open' flag in struct tipc_group to struct tipc_sock,
      and let the former retain only a pointer to the moved field. This will
      eliminate the race risk.
      
      Reported-by: syzbot+799dafde0286795858ac@syzkaller.appspotmail.com
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60c25306
  16. 16 1月, 2018 2 次提交
    • J
      tipc: fix bug during lookup of multicast destination nodes · e9a03445
      Jon Maloy 提交于
      In commit 232d07b7 ("tipc: improve groupcast scope handling") we
      inadvertently broke non-group multicast transmission when changing the
      parameter 'domain' to 'scope' in the function
      tipc_nametbl_lookup_dst_nodes(). We missed to make the corresponding
      change in the calling function, with the result that the lookup always
      fails.
      
      A closer anaysis reveals that this parameter is not needed at all.
      Non-group multicast is hard coded to use CLUSTER_SCOPE, and in the
      current implementation this will be delivered to all matching
      destinations except those which are published with NODE_SCOPE on other
      nodes. Since such publications never will be visible on the sending node
      anyway, it makes no sense to discriminate by scope at all.
      
      We now remove this parameter altogether.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9a03445
    • J
      tipc: fix a potental access after delete in tipc_sk_join() · febafc84
      Jon Maloy 提交于
      In commit d12d2e12 "tipc: send out join messages as soon as new
      member is discovered") we added a call to the function tipc_group_join()
      without considering the case that the preceding tipc_sk_publish() might
      have failed, and the group item already deleted.
      
      We fix this by returning from tipc_sk_join() directly after the
      failed tipc_sk_publish.
      
      Reported-by: syzbot+e3eeae78ea88b8d6d858@syzkaller.appspotmail.com
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      febafc84
  17. 10 1月, 2018 4 次提交
    • J
      tipc: improve poll() for group member socket · eb929a91
      Jon Maloy 提交于
      The current criteria for returning POLLOUT from a group member socket is
      too simplistic. It basically returns POLLOUT as soon as the group has
      external destinations, something obviously leading to a lot of spinning
      during destination congestion situations. At the same time, the internal
      congestion handling is unnecessarily complex.
      
      We now change this as follows.
      
      - We introduce an 'open' flag in  struct tipc_group. This flag is used
        only to help poll() get the setting of POLLOUT right, and *not* for
        congeston handling as such. This means that a user can choose to
        ignore an  EAGAIN for a destination and go on sending messages to
        other destinations in the group if he wants to.
      
      - The flag is set to false every time we return EAGAIN on a send call.
      
      - The flag is set to true every time any member, i.e., not necessarily
        the member that caused EAGAIN, is removed from the small_win list.
      
      - We remove the group member 'usr_pending' flag. The size of the send
        window and presence in the 'small_win' list is sufficient criteria
        for recognizing congestion.
      
      This solution seems to be a reasonable compromise between 'anycast',
      which is normally not waiting for POLLOUT for a specific destination,
      and the other three send modes, which are.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb929a91
    • J
      tipc: improve groupcast scope handling · 232d07b7
      Jon Maloy 提交于
      When a member joins a group, it also indicates a binding scope. This
      makes it possible to create both node local groups, invisible to other
      nodes, as well as cluster global groups, visible everywhere.
      
      In order to avoid that different members end up having permanently
      differing views of group size and memberhip, we must inhibit locally
      and globally bound members from joining the same group.
      
      We do this by using the binding scope as an additional separator between
      groups. I.e., a member must ignore all membership events from sockets
      using a different scope than itself, and all lookups for message
      destinations must require an exact match between the message's lookup
      scope and the potential target's binding scope.
      
      Apart from making it possible to create local groups using the same
      identity on different nodes, a side effect of this is that it now also
      becomes possible to create a cluster global group with the same identity
      across the same nodes, without interfering with the local groups.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      232d07b7
    • J
      tipc: send out join messages as soon as new member is discovered · d12d2e12
      Jon Maloy 提交于
      When a socket is joining a group, we look up in the binding table to
      find if there are already other members of the group present. This is
      used for being able to return EAGAIN instead of EHOSTUNREACH if the
      user proceeds directly to a send attempt.
      
      However, the information in the binding table can be used to directly
      set the created member in state MBR_PUBLISHED and send a JOIN message
      to the peer, instead of waiting for a topology PUBLISH event to do this.
      When there are many members in a group, the propagation time for such
      events can be significant, and we can save time during the join
      operation if we use the initial lookup result fully.
      
      In this commit, we eliminate the member state MBR_DISCOVERED which has
      been the result of the initial lookup, and do instead go directly to
      MBR_PUBLISHED, which initiates the setup.
      
      After this change, the tipc_member FSM looks as follows:
      
           +-----------+
      ---->| PUBLISHED |-----------------------------------------------+
      PUB- +-----------+                                 LEAVE/WITHRAW |
      LISH       |JOIN                                                 |
                 |     +-------------------------------------------+   |
                 |     |                            LEAVE/WITHDRAW |   |
                 |     |                +------------+             |   |
                 |     |   +----------->|  PENDING   |---------+   |   |
                 |     |   |msg/maxactv +-+---+------+  LEAVE/ |   |   |
                 |     |   |              |   |       WITHDRAW |   |   |
                 |     |   |   +----------+   |                |   |   |
                 |     |   |   |revert/maxactv|                |   |   |
                 |     |   |   V              V                V   V   V
                 |   +----------+  msg  +------------+       +-----------+
                 +-->|  JOINED  |------>|   ACTIVE   |------>|  LEAVING  |--->
                 |   +----------+       +--- -+------+ LEAVE/+-----------+DOWN
                 |        A   A               |      WITHDRAW A   A    A   EVT
                 |        |   |               |RECLAIM        |   |    |
                 |        |   |REMIT          V               |   |    |
                 |        |   |== adv   +------------+        |   |    |
                 |        |   +---------| RECLAIMING |--------+   |    |
                 |        |             +-----+------+  LEAVE/    |    |
                 |        |                   |REMIT   WITHDRAW   |    |
                 |        |                   |< adv              |    |
                 |        |msg/               V            LEAVE/ |    |
                 |        |adv==ADV_IDLE+------------+   WITHDRAW |    |
                 |        +-------------|  REMITTED  |------------+    |
                 |                      +------------+                 |
                 |PUBLISH                                              |
      JOIN +-----------+                                LEAVE/WITHDRAW |
      ---->|  JOINING  |-----------------------------------------------+
           +-----------+
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d12d2e12
    • J
      tipc: create group member event messages when they are needed · 7ad32bcb
      Jon Maloy 提交于
      In the current implementation, a group socket receiving topology
      events about other members just converts the topology event message
      into a group event message and stores it until it reaches the right
      state to issue it to the user. This complicates the code unnecessarily,
      and becomes impractical when we in the coming commits will need to
      create and issue membership events independently.
      
      In this commit, we change this so that we just notice the type and
      origin of the incoming topology event, and then drop the buffer. Only
      when it is time to actually send a group event to the user do we
      explicitly create a new message and send it upwards.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ad32bcb
  18. 29 12月, 2017 1 次提交
  19. 14 12月, 2017 1 次提交
  20. 11 12月, 2017 1 次提交
    • T
      rhashtable: Change rhashtable_walk_start to return void · 97a6ec4a
      Tom Herbert 提交于
      Most callers of rhashtable_walk_start don't care about a resize event
      which is indicated by a return value of -EAGAIN. So calls to
      rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
      like this is common:
      
             ret = rhashtable_walk_start(rhiter);
             if (ret && ret != -EAGAIN)
                     goto out;
      
      Since zero and -EAGAIN are the only possible return values from the
      function this check is pointless. The condition never evaluates to true.
      
      This patch changes rhashtable_walk_start to return void. This simplifies
      code for the callers that ignore -EAGAIN. For the few cases where the
      caller cares about the resize event, particularly where the table can be
      walked in mulitple parts for netlink or seq file dump, the function
      rhashtable_walk_start_check has been added that returns -EAGAIN on a
      resize event.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97a6ec4a
  21. 28 11月, 2017 1 次提交
  22. 01 11月, 2017 1 次提交
  23. 26 10月, 2017 1 次提交
  24. 22 10月, 2017 1 次提交
  25. 21 10月, 2017 1 次提交
  26. 13 10月, 2017 8 次提交
    • J
      tipc: guarantee delivery of UP event before first broadcast · 399574d4
      Jon Maloy 提交于
      The following scenario is possible:
      - A user joins a group, and immediately sends out a broadcast message
        to its members.
      - The broadcast message, following a different data path than the
        initial JOIN message sent out during the joining procedure, arrives
        to a receiver before the latter..
      - The receiver drops the message, since it is not ready to accept any
        messages until the JOIN has arrived.
      
      We avoid this by treating group protocol JOIN messages like unicast
      messages.
      - We let them pass through the recipient's multicast input queue, just
        like ordinary unicasts.
      - We force the first following broadacst to be sent as replicated
        unicast and being acknowledged by the recipient before accepting
        any more broadcast transmissions.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      399574d4
    • J
      tipc: guarantee that group broadcast doesn't bypass group unicast · 2f487712
      Jon Maloy 提交于
      We need a mechanism guaranteeing that group unicasts sent out from a
      socket are not bypassed by later sent broadcasts from the same socket.
      We do this as follows:
      
      - Each time a unicast is sent, we set a the broadcast method for the
        socket to "replicast" and "mandatory". This forces the first
        subsequent broadcast message to follow the same network and data path
        as the preceding unicast to a destination, hence preventing it from
        overtaking the latter.
      
      - In order to make the 'same data path' statement above true, we let
        group unicasts pass through the multicast link input queue, instead
        of as previously through the unicast link input queue.
      
      - In the first broadcast following a unicast, we set a new header flag,
        requiring all recipients to immediately acknowledge its reception.
      
      - During the period before all the expected acknowledges are received,
        the socket refuses to accept any more broadcast attempts, i.e., by
        blocking or returning EAGAIN. This period should typically not be
        longer than a few microseconds.
      
      - When all acknowledges have been received, the sending socket will
        open up for subsequent broadcasts, this time giving the link layer
        freedom to itself select the best transmission method.
      
      - The forced and/or abrupt transmission method changes described above
        may lead to broadcasts arriving out of order to the recipients. We
        remedy this by introducing code that checks and if necessary
        re-orders such messages at the receiving end.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f487712
    • J
      tipc: guarantee group unicast doesn't bypass group broadcast · b87a5ea3
      Jon Maloy 提交于
      Group unicast messages don't follow the same path as broadcast messages,
      and there is a high risk that unicasts sent from a socket might bypass
      previously sent broadcasts from the same socket.
      
      We fix this by letting all unicast messages carry the sequence number of
      the next sent broadcast from the same node, but without updating this
      number at the receiver. This way, a receiver can check and if necessary
      re-order such messages before they are added to the socket receive buffer.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87a5ea3
    • J
      tipc: introduce group multicast messaging · 5b8dddb6
      Jon Maloy 提交于
      The previously introduced message transport to all group members is
      based on the tipc multicast service, but is logically a broadcast
      service within the group, and that is what we call it.
      
      We now add functionality for sending messages to all group members
      having a certain identity. Correspondingly, we call this feature 'group
      multicast'. The service is using unicast when only one destination is
      found, otherwise it will use the bearer broadcast service to transfer
      the messages. In the latter case, the receiving members filter arriving
      messages by looking at the intended destination instance. If there is
      no match, the message will be dropped, while still being considered
      received and read as seen by the flow control mechanism.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b8dddb6
    • J
      tipc: introduce group anycast messaging · ee106d7f
      Jon Maloy 提交于
      In this commit, we make it possible to send connectionless unicast
      messages to any member corresponding to the given member identity,
      when there is more than one such member. The sender must use a
      TIPC_ADDR_NAME address to achieve this effect.
      
      We also perform load balancing between the destinations, i.e., we
      primarily select one which has advertised sufficient send window
      to not cause a block/EAGAIN delay, if any. This mechanism is
      overlayed on the always present round-robin selection.
      
      Anycast messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee106d7f
    • J
      tipc: introduce group unicast messaging · 27bd9ec0
      Jon Maloy 提交于
      We now make it possible to send connectionless unicast messages
      within a communication group. To send a message, the sender can use
      either a direct port address, aka port identity, or an indirect port
      name to be looked up.
      
      This type of messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27bd9ec0
    • J
      tipc: introduce flow control for group broadcast messages · b7d42635
      Jon Maloy 提交于
      We introduce an end-to-end flow control mechanism for group broadcast
      messages. This ensures that no messages are ever lost because of
      destination receive buffer overflow, with minimal impact on performance.
      For now, the algorithm is based on the assumption that there is only one
      active transmitter at any moment in time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7d42635
    • J
      tipc: receive group membership events via member socket · ae236fb2
      Jon Maloy 提交于
      Like with any other service, group members' availability can be
      subscribed for by connecting to be topology server. However, because
      the events arrive via a different socket than the member socket, there
      is a real risk that membership events my arrive out of synch with the
      actual JOIN/LEAVE action. I.e., it is possible to receive the first
      messages from a new member before the corresponding JOIN event arrives,
      just as it is possible to receive the last messages from a leaving
      member after the LEAVE event has already been received.
      
      Since each member socket is internally also subscribing for membership
      events, we now fix this problem by passing those events on to the user
      via the member socket. We leverage the already present member synch-
      ronization protocol to guarantee correct message/event order. An event
      is delivered to the user as an empty message where the two source
      addresses identify the new/lost member. Furthermore, we set the MSG_OOB
      bit in the message flags to mark it as an event. If the event is an
      indication about a member loss we also set the MSG_EOR bit, so it can
      be distinguished from a member addition event.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae236fb2