1. 04 5月, 2016 1 次提交
    • J
      tipc: redesign connection-level flow control · 10724cc7
      Jon Paul Maloy 提交于
      There are two flow control mechanisms in TIPC; one at link level that
      handles network congestion, burst control, and retransmission, and one
      at connection level which' only remaining task is to prevent overflow
      in the receiving socket buffer. In TIPC, the latter task has to be
      solved end-to-end because messages can not be thrown away once they
      have been accepted and delivered upwards from the link layer, i.e, we
      can never permit the receive buffer to overflow.
      
      Currently, this algorithm is message based. A counter in the receiving
      socket keeps track of number of consumed messages, and sends a dedicated
      acknowledge message back to the sender for each 256 consumed message.
      A counter at the sending end keeps track of the sent, not yet
      acknowledged messages, and blocks the sender if this number ever reaches
      512 unacknowledged messages. When the missing acknowledge arrives, the
      socket is then woken up for renewed transmission. This works well for
      keeping the message flow running, as it almost never happens that a
      sender socket is blocked this way.
      
      A problem with the current mechanism is that it potentially is very
      memory consuming. Since we don't distinguish between small and large
      messages, we have to dimension the socket receive buffer according
      to a worst-case of both. I.e., the window size must be chosen large
      enough to sustain a reasonable throughput even for the smallest
      messages, while we must still consider a scenario where all messages
      are of maximum size. Hence, the current fix window size of 512 messages
      and a maximum message size of 66k results in a receive buffer of 66 MB
      when truesize(66k) = 131k is taken into account. It is possible to do
      much better.
      
      This commit introduces an algorithm where we instead use 1024-byte
      blocks as base unit. This unit, always rounded upwards from the
      actual message size, is used when we advertise windows as well as when
      we count and acknowledge transmitted data. The advertised window is
      based on the configured receive buffer size in such a way that even
      the worst-case truesize/msgsize ratio always is covered. Since the
      smallest possible message size (from a flow control viewpoint) now is
      1024 bytes, we can safely assume this ratio to be less than four, which
      is the value we are now using.
      
      This way, we have been able to reduce the default receive buffer size
      from 66 MB to 2 MB with maintained performance.
      
      In order to keep this solution backwards compatible, we introduce a
      new capability bit in the discovery protocol, and use this throughout
      the message sending/reception path to always select the right unit.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10724cc7
  2. 27 7月, 2015 1 次提交
    • J
      tipc: clean up socket layer message reception · cda3696d
      Jon Paul Maloy 提交于
      When a message is received in a socket, one of the call chains
      tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
      or
      tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
      are followed. At each of these levels we may encounter situations
      where the message may need to be rejected, or a new message
      produced for transfer back to the sender. Despite recent
      improvements, the current code for doing this is perceived
      as awkward and hard to follow.
      
      Leveraging the two previous commits in this series, we now
      introduce a more uniform handling of such situations. We
      let each of the functions in the chain itself produce/reverse
      the message to be returned to the sender, but also perform the
      actual forwarding. This simplifies the necessary logics within
      each function.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cda3696d
  3. 18 3月, 2015 1 次提交
    • Y
      tipc: fix netns refcnt leak · 76100a8a
      Ying Xue 提交于
      When the TIPC module is loaded, we launch a topology server in kernel
      space, which in its turn is creating TIPC sockets for communication
      with topology server users. Because both the socket's creator and
      provider reside in the same module, it is necessary that the TIPC
      module's reference count remains zero after the server is started and
      the socket created; otherwise it becomes impossible to perform "rmmod"
      even on an idle module.
      
      Currently, we achieve this by defining a separate "tipc_proto_kern"
      protocol struct, that is used only for kernel space socket allocations.
      This structure has the "owner" field set to NULL, which restricts the
      module reference count from being be bumped when sk_alloc() for local
      sockets is called. Furthermore, we have defined three kernel-specific
      functions, tipc_sock_create_local(), tipc_sock_release_local() and
      tipc_sock_accept_local(), to avoid the module counter being modified
      when module local sockets are created or deleted. This has worked well
      until we introduced name space support.
      
      However, after name space support was introduced, we have observed that
      a reference count leak occurs, because the netns counter is not
      decremented in tipc_sock_delete_local().
      
      This commit remedies this problem. But instead of just modifying
      tipc_sock_delete_local(), we eliminate the whole parallel socket
      handling infrastructure, and start using the regular sk_create_kern(),
      kernel_accept() and sk_release_kernel() calls. Since those functions
      manipulate the module counter, we must now compensate for that by
      explicitly decrementing the counter after module local sockets are
      created, and increment it just before calling sk_release_kernel().
      
      Fixes: a62fbcce ("tipc: make subscriber server support net namespace")
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reported-by: NCong Wang <cwang@twopensource.com>
      Tested-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76100a8a
  4. 10 2月, 2015 1 次提交
  5. 06 2月, 2015 3 次提交
    • J
      tipc: eliminate race condition at multicast reception · cb1b7280
      Jon Paul Maloy 提交于
      In a previous commit in this series we resolved a race problem during
      unicast message reception.
      
      Here, we resolve the same problem at multicast reception. We apply the
      same technique: an input queue serializing the delivery of arriving
      buffers. The main difference is that here we do it in two steps.
      First, the broadcast link feeds arriving buffers into the tail of an
      arrival queue, which head is consumed at the socket level, and where
      destination lookup is performed. Second, if the lookup is successful,
      the resulting buffer clones are fed into a second queue, the input
      queue. This queue is consumed at reception in the socket just like
      in the unicast case. Both queues are protected by the same lock, -the
      one of the input queue.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1b7280
    • J
      tipc: simplify socket multicast reception · 3c724acd
      Jon Paul Maloy 提交于
      The structure 'tipc_port_list' is used to collect port numbers
      representing multicast destination socket on a receiving node.
      The list is not based on a standard linked list, and is in reality
      optimized for the uncommon case that there are more than one
      multicast destinations per node. This makes the list handling
      unecessarily complex, and as a consequence, even the socket
      multicast reception becomes more complex.
      
      In this commit, we replace 'tipc_port_list' with a new 'struct
      tipc_plist', which is based on a standard list. We give the new
      list stack (push/pop) semantics, someting that simplifies
      the implementation of the function tipc_sk_mcast_rcv().
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c724acd
    • J
      tipc: resolve race problem at unicast message reception · c637c103
      Jon Paul Maloy 提交于
      TIPC handles message cardinality and sequencing at the link layer,
      before passing messages upwards to the destination sockets. During the
      upcall from link to socket no locks are held. It is therefore possible,
      and we see it happen occasionally, that messages arriving in different
      threads and delivered in sequence still bypass each other before they
      reach the destination socket. This must not happen, since it violates
      the sequentiality guarantee.
      
      We solve this by adding a new input buffer queue to the link structure.
      Arriving messages are added safely to the tail of that queue by the
      link, while the head of the queue is consumed, also safely, by the
      receiving socket. Sequentiality is secured per socket by only allowing
      buffers to be dequeued inside the socket lock. Since there may be multiple
      simultaneous readers of the queue, we use a 'filter' parameter to reduce
      the risk that they peek the same buffer from the queue, hence also
      reducing the risk of contention on the receiving socket locks.
      
      This solves the sequentiality problem, and seems to cause no measurable
      performance degradation.
      
      A nice side effect of this change is that lock handling in the functions
      tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
      will enable future simplifications of those functions.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c637c103
  6. 13 1月, 2015 4 次提交
  7. 09 1月, 2015 1 次提交
    • Y
      tipc: convert tipc reference table to use generic rhashtable · 07f6c4bc
      Ying Xue 提交于
      As tipc reference table is statically allocated, its memory size
      requested on stack initialization stage is quite big even if the
      maximum port number is just restricted to 8191 currently, however,
      the number already becomes insufficient in practice. But if the
      maximum ports is allowed to its theory value - 2^32, its consumed
      memory size will reach a ridiculously unacceptable value. Apart from
      this, heavy tipc users spend a considerable amount of time in
      tipc_sk_get() due to the read-lock on ref_table_lock.
      
      If tipc reference table is converted with generic rhashtable, above
      mentioned both disadvantages would be resolved respectively: making
      use of the new resizable hash table can avoid locking on the lookup;
      smaller memory size is required at initial stage, for example, 256
      hash bucket slots are requested at the beginning phase instead of
      allocating the entire 8191 slots in old mode. The hash table will
      grow if entries exceeds 75% of table size up to a total table size
      of 1M, and it will automatically shrink if usage falls below 30%,
      but the minimum table size is allowed down to 256.
      
      Also converts ref_table_lock to a separate mutex to protect hash table
      mutations on write side. Lastly defers the release of the socket
      reference using call_rcu() to allow using an RCU read-side protected
      call to rhashtable_lookup().
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NErik Hugne <erik.hugne@ericsson.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f6c4bc
  8. 22 11月, 2014 2 次提交
  9. 24 8月, 2014 5 次提交
  10. 17 7月, 2014 1 次提交
  11. 28 6月, 2014 2 次提交
    • J
      tipc: simplify connection congestion handling · 60120526
      Jon Paul Maloy 提交于
      As a consequence of the recently introduced serialized access
      to the socket in commit 8d94168a761819d10252bab1f8de6d7b202c3baa
      ("tipc: same receive code path for connection protocol and data
      messages") we can make a number of simplifications in the
      detection and handling of connection congestion situations.
      
      - We don't need to keep two counters, one for sent messages and one
        for acked messages. There is no longer any risk for races between
        acknowledge messages arriving in BH and data message sending
        running in user context. So we merge this into one counter,
        'sent_unacked', which is incremented at sending and subtracted
        from at acknowledge reception.
      
      - We don't need to set the 'congested' field in tipc_port to
        true before we sent the message, and clear it when sending
        is successful. (As a matter of fact, it was never necessary;
        the field was set in link_schedule_port() before any wakeup
        could arrive anyway.)
      
      - We keep the conditions for link congestion and connection connection
        congestion separated. There would otherwise be a risk that an arriving
        acknowledge message may wake up a user sleeping because of link
        congestion.
      
      - We can simplify reception of acknowledge messages.
      
      We also make some cosmetic/structural changes:
      
      - We rename the 'congested' field to the more correct 'link_cong´.
      
      - We rename 'conn_unacked' to 'rcv_unacked'
      
      - We move the above mentioned fields from struct tipc_port to
        struct tipc_sock.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60120526
    • J
      tipc: clean up connection protocol reception function · ac0074ee
      Jon Paul Maloy 提交于
      We simplify the code for receiving connection probes, leveraging the
      recently introduced tipc_msg_reverse() function. We also stick to
      the principle of sending a possible response message directly from
      the calling (tipc_sk_rcv or backlog_rcv) functions, hence making
      the call chain shallower and easier to follow.
      
      We make one small protocol change here, allowed according to
      the spec. If a protocol message arrives from a remote socket that
      is not the one we are connected to, we are currently generating a
      connection abort message and send it to the source. This behavior
      is unnecessary, and might even be a security risk, so instead we
      now choose to only ignore the message. The consequnce for the sender
      is that he will need longer time to discover his mistake (until the
      next timeout), but this is an extreme corner case, and may happen
      anyway under other circumstances, so we deem this change acceptable.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac0074ee
  12. 15 5月, 2014 2 次提交
    • J
      tipc: merge port message reception into socket reception function · 9816f061
      Jon Paul Maloy 提交于
      In order to reduce complexity and save a call level during message
      reception at port/socket level, we remove the function tipc_port_rcv()
      and merge its functionality into tipc_sk_rcv().
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9816f061
    • J
      tipc: compensate for double accounting in socket rcv buffer · 4f4482dc
      Jon Paul Maloy 提交于
      The function net/core/sock.c::__release_sock() runs a tight loop
      to move buffers from the socket backlog queue to the receive queue.
      
      As a security measure, sk_backlog.len of the receiving socket
      is not set to zero until after the loop is finished, i.e., until
      the whole backlog queue has been transferred to the receive queue.
      During this transfer, the data that has already been moved is counted
      both in the backlog queue and the receive queue, hence giving an
      incorrect picture of the available queue space for new arriving buffers.
      
      This leads to unnecessary rejection of buffers by sk_add_backlog(),
      which in TIPC leads to unnecessarily broken connections.
      
      In this commit, we compensate for this double accounting by adding
      a counter that keeps track of it. The function socket.c::backlog_rcv()
      receives buffers one by one from __release_sock(), and adds them to the
      socket receive queue. If the transfer is successful, it increases a new
      atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
      transferred buffer. If a new buffer arrives during this transfer and
      finds the socket busy (owned), we attempt to add it to the backlog.
      However, when sk_add_backlog() is called, we adjust the 'limit'
      parameter with the value of the new counter, so that the risk of
      inadvertent rejection is eliminated.
      
      It should be noted that this change does not invalidate the original
      purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
      upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
      doesn't respect the send window) keeps pumping in buffers to
      sk_add_backlog(), he will eventually reach an upper limit,
      (2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
      to the backlog, and the connection will be broken. Ordinary, well-
      behaved senders will never reach this buffer limit at all.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f4482dc
  13. 13 3月, 2014 3 次提交
  14. 18 6月, 2013 1 次提交
    • Y
      tipc: change socket buffer overflow control to respect sk_rcvbuf · cc79dd1b
      Ying Xue 提交于
      As per feedback from the netdev community, we change the buffer
      overflow protection algorithm in receiving sockets so that it
      always respects the nominal upper limit set in sk_rcvbuf.
      
      Instead of scaling up from a small sk_rcvbuf value, which leads to
      violation of the configured sk_rcvbuf limit, we now calculate the
      weighted per-message limit by scaling down from a much bigger value,
      still in the same field, according to the importance priority of the
      received message.
      
      To allow for administrative tunability of the socket receive buffer
      size, we create a tipc_rmem sysctl variable to allow the user to
      configure an even bigger value via sysctl command.  It is a size of
      three (min/default/max) to be consistent with things like tcp_rmem.
      
      By default, the value initialized in tipc_rmem[1] is equal to the
      receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
      This value is also set as the default value of sk_rcvbuf.
      Originally-by: NJon Maloy <jon.maloy@ericsson.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      [Ying: added sysctl variation to Jon's original patch]
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      [PG: don't compile sysctl.c if not config'd; add Documentation]
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc79dd1b
  15. 14 7月, 2012 3 次提交
    • E
      tipc: remove print_buf and deprecated log buffer code · 869dd466
      Erik Hugne 提交于
      The internal log buffer handling functions can now safely be
      removed since there is no code using it anymore.  Requests to
      interact with the internal tipc log buffer over netlink (in
      config.c) will report 'obsolete command'.
      
      This represents the final removal of any references to a
      struct print_buf, and the removal of the struct itself.
      We also get rid of a TIPC specific Kconfig in the process.
      
      Finally, log.h is removed since it is not needed anymore.
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      869dd466
    • E
      tipc: phase out most of the struct print_buf usage · dc1aed37
      Erik Hugne 提交于
      The tipc_printf is renamed to tipc_snprintf, as the new name
      describes more what the function actually does.  It is also
      changed to take a buffer and length parameter and return
      number of characters written to the buffer.  All callers of
      this function that used to pass a print_buf are updated.
      
      Final removal of the struct print_buf itself will be done
      synchronously with the pending removal of the deprecated
      logging code that also was using it.
      
      Functions that build up a response message with a list of
      ports, nametable contents etc. are changed to return the number
      of characters written to the output buffer. This information
      was previously hidden in a field of the print_buf struct, and
      the number of chars written was fetched with a call to
      tipc_printbuf_validate.  This function is removed since it
      is no longer referenced nor needed.
      
      A generic max size ULTRA_STRING_MAX_LEN is defined, named
      in keeping with the existing TIPC_TLV_ULTRA_STRING, and the
      various definitions in port, link and nametable code that
      largely duplicated this information are removed.  This means
      that amount of link statistics that can be returned is now
      increased from 2k to 32k.
      
      The buffer overflow check is now done just before the reply
      message is passed over netlink or TIPC to a remote node and
      the message indicating a truncated buffer is changed to a less
      dramatic one (less CAPS), placed at the end of the message.
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      dc1aed37
    • E
      tipc: simplify print buffer handling in tipc_printf · e2dbd601
      Erik Hugne 提交于
      tipc_printf was previously used both to construct debug traces
      and to append data to buffers that should be sent over netlink
      to the tipc-config application.  A global print_buffer was
      used to format the string before it was copied to the actual
      output buffer.  This could lead to concurrent access of the
      global print_buffer, which then had to be lock protected.
      This is simplified by changing tipc_printf to append data
      directly to the output buffer using vscnprintf.
      
      With the new implementation of tipc_printf, there is no longer
      any risk of concurrent access to the internal log buffer, so
      the lock (and the comments describing it) are no longer
      strictly necessary.  However, there are still a few functions
      that do grab this lock before resizing/dumping the log
      buffer.  We leave the lock, and these functions untouched since
      they will be removed with a subsequent commit that drops the
      deprecated log buffer handling code
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      e2dbd601
  16. 01 5月, 2012 1 次提交
    • P
      tipc: compress out gratuitous extra carriage returns · 617d3c7a
      Paul Gortmaker 提交于
      Some of the comment blocks are floating in limbo between two
      functions, or between blocks of code.  Delete the extra line
      feeds between any comment and its associated following block
      of code, to be consistent with the majority of the rest of
      the kernel.  Also delete trailing newlines at EOF and fix
      a couple trivial typos in existing comments.
      
      This is a 100% cosmetic change with no runtime impact.  We get
      rid of over 500 lines of non-code, and being blank line deletes,
      they won't even show up as noise in git blame.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      617d3c7a
  17. 25 2月, 2012 1 次提交
  18. 02 1月, 2011 3 次提交
  19. 17 10月, 2010 1 次提交
  20. 24 9月, 2010 1 次提交
  21. 19 3月, 2009 1 次提交
  22. 05 5月, 2008 1 次提交