1. 28 9月, 2018 2 次提交
    • D
      rxrpc: Emit BUSY packets when supposed to rather than ABORTs · ece64fec
      David Howells 提交于
      In the input path, a received sk_buff can be marked for rejection by
      setting RXRPC_SKB_MARK_* in skb->mark and, if needed, some auxiliary data
      (such as an abort code) in skb->priority.  The rejection is handled by
      queueing the sk_buff up for dealing with in process context.  The output
      code reads the mark and priority and, theoretically, generates an
      appropriate response packet.
      
      However, if RXRPC_SKB_MARK_BUSY is set, this isn't noticed and an ABORT
      message with a random abort code is generated (since skb->priority wasn't
      set to anything).
      
      Fix this by outputting the appropriate sort of packet.
      
      Also, whilst we're at it, most of the marks are no longer used, so remove
      them and rename the remaining two to something more obvious.
      
      Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ece64fec
    • D
      rxrpc: Fix RTT gathering · b604dd98
      David Howells 提交于
      Fix RTT information gathering in AF_RXRPC by the following means:
      
       (1) Enable Rx timestamping on the transport socket with SO_TIMESTAMPNS.
      
       (2) If the sk_buff doesn't have a timestamp set when rxrpc_data_ready()
           collects it, set it at that point.
      
       (3) Allow ACKs to be requested on the last packet of a client call, but
           not a service call.  We need to be careful lest we undo:
      
      	bf7d620a
      	Author: David Howells <dhowells@redhat.com>
      	Date:   Thu Oct 6 08:11:51 2016 +0100
      	rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase
      
           but that only really applies to service calls that we're handling,
           since the client side gets to send the final ACK (or not).
      
       (4) When about to transmit an ACK or DATA packet, record the Tx timestamp
           before only; don't update the timestamp afterwards.
      
       (5) Switch the ordering between recording the serial and recording the
           timestamp to always set the serial number first.  The serial number
           shouldn't be seen referenced by an ACK packet until we've transmitted
           the packet bearing it - so in the Rx path, we don't need the timestamp
           until we've checked the serial number.
      
      Fixes: cf1a6474 ("rxrpc: Add per-peer RTT tracker")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      b604dd98
  2. 09 8月, 2018 1 次提交
    • D
      rxrpc: Fix the keepalive generator [ver #2] · 330bdcfa
      David Howells 提交于
      AF_RXRPC has a keepalive message generator that generates a message for a
      peer ~20s after the last transmission to that peer to keep firewall ports
      open.  The implementation is incorrect in the following ways:
      
       (1) It mixes up ktime_t and time64_t types.
      
       (2) It uses ktime_get_real(), the output of which may jump forward or
           backward due to adjustments to the time of day.
      
       (3) If the current time jumps forward too much or jumps backwards, the
           generator function will crank the base of the time ring round one slot
           at a time (ie. a 1s period) until it catches up, spewing out VERSION
           packets as it goes.
      
      Fix the problem by:
      
       (1) Only using time64_t.  There's no need for sub-second resolution.
      
       (2) Use ktime_get_seconds() rather than ktime_get_real() so that time
           isn't perceived to go backwards.
      
       (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
           parts:
      
           (a) The "worker" function that manages the buckets and the timer.
      
           (b) The "dispatch" function that takes the pending peers and
           	 potentially transmits a keepalive packet before putting them back
           	 in the ring into the slot appropriate to the revised last-Tx time.
      
       (4) Taking everything that's pending out of the ring and splicing it into
           a temporary collector list for processing.
      
           In the case that there's been a significant jump forward, the ring
           gets entirely emptied and then the time base can be warped forward
           before the peers are processed.
      
           The warping can't happen if the ring isn't empty because the slot a
           peer is in is keepalive-time dependent, relative to the base time.
      
       (5) Limit the number of iterations of the bucket array when scanning it.
      
       (6) Set the timer to skip any empty slots as there's no point waking up if
           there's nothing to do yet.
      
      This can be triggered by an incoming call from a server after a reboot with
      AF_RXRPC and AFS built into the kernel causing a peer record to be set up
      before userspace is started.  The system clock is then adjusted by
      userspace, thereby potentially causing the keepalive generator to have a
      meltdown - which leads to a message like:
      
      	watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
      	...
      	Workqueue: krxrpcd rxrpc_peer_keepalive_worker
      	EIP: lock_acquire+0x69/0x80
      	...
      	Call Trace:
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? _raw_spin_lock_bh+0x29/0x60
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? __lock_acquire+0x3d3/0x870
      	 ? process_one_work+0x110/0x340
      	 ? process_one_work+0x166/0x340
      	 ? process_one_work+0x110/0x340
      	 ? worker_thread+0x39/0x3c0
      	 ? kthread+0xdb/0x110
      	 ? cancel_delayed_work+0x90/0x90
      	 ? kthread_stop+0x70/0x70
      	 ? ret_from_fork+0x19/0x24
      
      Fixes: ace45bec ("rxrpc: Fix firewall route keepalive")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      330bdcfa
  3. 01 8月, 2018 1 次提交
    • D
      rxrpc: Trace packet transmission · 4764c0da
      David Howells 提交于
      Trace successful packet transmission (kernel_sendmsg() succeeded, that is)
      in AF_RXRPC.  We can share the enum that defines the transmission points
      with the trace_rxrpc_tx_fail() tracepoint, so rename its constants to be
      applicable to both.
      
      Also, save the internal call->debug_id in the rxrpc_channel struct so that
      it can be used in retransmission trace lines.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4764c0da
  4. 11 5月, 2018 2 次提交
    • D
      rxrpc: Trace UDP transmission failure · 6b47fe1d
      David Howells 提交于
      Add a tracepoint to log transmission failure from the UDP transport socket
      being used by AF_RXRPC.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6b47fe1d
    • D
      rxrpc: Fix missing start of call timeout · c54e43d7
      David Howells 提交于
      The expect_rx_by call timeout is supposed to be set when a call is started
      to indicate that we need to receive a packet by that point.  This is
      currently put back every time we receive a packet, but it isn't started
      when we first send a packet.  Without this, the call may wait forever if
      the server doesn't deign to reply.
      
      Fix this by setting the timeout upon a successful UDP sendmsg call for the
      first DATA packet.  The timeout is initiated only for initial transmission
      and not for subsequent retries as we don't want the retry mechanism to
      extend the timeout indefinitely.
      
      Fixes: a158bdd3 ("rxrpc: Fix call timeouts")
      Reported-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c54e43d7
  5. 31 3月, 2018 1 次提交
    • D
      rxrpc: Fix firewall route keepalive · ace45bec
      David Howells 提交于
      Fix the firewall route keepalive part of AF_RXRPC which is currently
      function incorrectly by replying to VERSION REPLY packets from the server
      with VERSION REQUEST packets.
      
      Instead, send VERSION REPLY packets to the peers of service connections to
      act as keep-alives 20s after the latest packet was transmitted to that
      peer.
      
      Also, just discard VERSION REPLY packets rather than replying to them.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ace45bec
  6. 23 2月, 2018 1 次提交
  7. 24 11月, 2017 2 次提交
    • D
      rxrpc: Add keepalive for a call · 415f44e4
      David Howells 提交于
      We need to transmit a packet every so often to act as a keepalive for the
      peer (which has a timeout from the last time it received a packet) and also
      to prevent any intervening firewalls from closing the route.
      
      Do this by resetting a timer every time we transmit a packet.  If the timer
      ever expires, we transmit a PING ACK packet and thereby also elicit a PING
      RESPONSE ACK from the other side - which prevents our last-rx timeout from
      expiring.
      
      The timer is set to 1/6 of the last-rx timeout so that we can detect the
      other side going away if it misses 6 replies in a row.
      
      This is particularly necessary for servers where the processing of the
      service function may take a significant amount of time.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      415f44e4
    • D
      rxrpc: Add a timeout for detecting lost ACKs/lost DATA · bd1fdf8c
      David Howells 提交于
      Add an extra timeout that is set/updated when we send a DATA packet that
      has the request-ack flag set.  This allows us to detect if we don't get an
      ACK in response to the latest flagged packet.
      
      The ACK packet is adjudged to have been lost if it doesn't turn up within
      2*RTT of the transmission.
      
      If the timeout occurs, we schedule the sending of a PING ACK to find out
      the state of the other side.  If a new DATA packet is ready to go sooner,
      we cancel the sending of the ping and set the request-ack flag on that
      instead.
      
      If we get back a PING-RESPONSE ACK that indicates a lower tx_top than what
      we had at the time of the ping transmission, we adjudge all the DATA
      packets sent between the response tx_top and the ping-time tx_top to have
      been lost and retransmit immediately.
      
      Rather than sending a PING ACK, we could just pick a DATA packet and
      speculatively retransmit that with request-ack set.  It should result in
      either a REQUESTED ACK or a DUPLICATE ACK which we can then use in lieu the
      a PING-RESPONSE ACK mentioned above.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bd1fdf8c
  8. 02 11月, 2017 2 次提交
    • D
      rxrpc: Fix call expiry handling · dcbefc30
      David Howells 提交于
      Fix call expiry handling in the following ways
      
       (1) If all the request data from a client call is acked, don't send a
           follow up IDLE ACK with firstPacket == 1 and previousPacket == 0 as
           this appears to fool some servers into thinking everything has been
           accepted.
      
       (2) Never send an abort back to the server once it has ACK'd all the
           request packets; rather just try to reuse the channel for the next
           call.  The first request DATA packet of the next call on the same
           channel will implicitly ACK the entire reply of the dead call - even
           if we haven't transmitted it yet.
      
       (3) Don't send RX_CALL_TIMEOUT in an ABORT packet, librx uses abort codes
           to pass local errors to the caller in addition to remote errors, and
           this is meant to be local only.
      
      The following also need to be addressed in future patches:
      
       (4) Service calls should send PING ACKs as 'keep alives' if the server is
           still processing the call.
      
       (5) VERSION REPLY packets should be sent to the peers of service
           connections to act as keep-alives.  This is used to keep firewall
           routes in place.  The AFS CM should enable this.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dcbefc30
    • D
      rxrpc: Fix a null ptr deref in rxrpc_fill_out_ack() · 1457cc4c
      David Howells 提交于
      rxrpc_fill_out_ack() needs to be passed the connection pointer from its
      caller rather than using call->conn as the call may be disconnected in
      parallel with it, clearing call->conn, leading to:
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      	IP: rxrpc_send_ack_packet+0x231/0x6a4
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      1457cc4c
  9. 29 8月, 2017 1 次提交
    • D
      rxrpc: Fix IPv6 support · 7b674e39
      David Howells 提交于
      Fix IPv6 support in AF_RXRPC in the following ways:
      
       (1) When extracting the address from a received IPv4 packet, if the local
           transport socket is open for IPv6 then fill out the sockaddr_rxrpc
           struct for an IPv4-mapped-to-IPv6 AF_INET6 transport address instead
           of an AF_INET one.
      
       (2) When sending CHALLENGE or RESPONSE packets, the transport length needs
           to be set from the sockaddr_rxrpc::transport_len field rather than
           sizeof() on the IPv4 transport address.
      
       (3) When processing an IPv4 ICMP packet received by an IPv6 socket, set up
           the address correctly before searching for the affected peer.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7b674e39
  10. 05 6月, 2017 1 次提交
    • D
      rxrpc: Add service upgrade support for client connections · 4e255721
      David Howells 提交于
      Make it possible for a client to use AuriStor's service upgrade facility.
      
      The client does this by adding an RXRPC_UPGRADE_SERVICE control message to
      the first sendmsg() of a call.  This takes no parameters.
      
      When recvmsg() starts returning data from the call, the service ID field in
      the returned msg_name will reflect the result of the upgrade attempt.  If
      the upgrade was ignored, srx_service will match what was set in the
      sendmsg(); if the upgrade happened the srx_service will be altered to
      indicate the service the server upgraded to.
      
      Note that:
      
       (1) The choice of upgrade service is up to the server
      
       (2) Further client calls to the same server that would share a connection
           are blocked if an upgrade probe is in progress.
      
       (3) This should only be used to probe the service.  Clients should then
           use the returned service ID in all subsequent communications with that
           server (and not set the upgrade).  Note that the kernel will not
           retain this information should the connection expire from its cache.
      
       (4) If a server that supports upgrading is replaced by one that doesn't,
           whilst a connection is live, and if the replacement is running, say,
           OpenAFS 1.6.4 or older or an older IBM AFS, then the replacement
           server will not respond to packets sent to the upgraded connection.
      
           At this point, calls will time out and the server must be reprobed.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4e255721
  11. 06 10月, 2016 3 次提交
    • D
      rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase · bf7d620a
      David Howells 提交于
      Don't request an ACK on the last DATA packet of a call's Tx phase as for a
      client there will be a reply packet or some sort of ACK to shift phase.  If
      the ACK is requested, OpenAFS sends a REQUESTED-ACK ACK with soft-ACKs in
      it and doesn't follow up with a hard-ACK.
      
      If we don't set the flag, OpenAFS will send a DELAY ACK that hard-ACKs the
      reply data, thereby allowing the call to terminate cleanly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bf7d620a
    • D
      rxrpc: Fix loss of PING RESPONSE ACK production due to PING ACKs · a5af7e1f
      David Howells 提交于
      Separate the output of PING ACKs from the output of other sorts of ACK so
      that if we receive a PING ACK and schedule transmission of a PING RESPONSE
      ACK, the response doesn't get cancelled by a PING ACK we happen to be
      scheduling transmission of at the same time.
      
      If a PING RESPONSE gets lost, the other side might just sit there waiting
      for it and refuse to proceed otherwise.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a5af7e1f
    • D
      rxrpc: Fix warning by splitting rxrpc_send_call_packet() · 26cb02aa
      David Howells 提交于
      Split rxrpc_send_data_packet() to separate ACK generation (which is more
      complicated) from ABORT generation.  This simplifies the code a bit and
      fixes the following warning:
      
      In file included from ../net/rxrpc/output.c:20:0:
      net/rxrpc/output.c: In function 'rxrpc_send_call_packet':
      net/rxrpc/ar-internal.h:1187:27: error: 'top' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      net/rxrpc/output.c:103:24: note: 'top' was declared here
      net/rxrpc/output.c:225:25: error: 'hard_ack' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      26cb02aa
  12. 30 9月, 2016 2 次提交
    • D
      rxrpc: Request more ACKs in slow-start mode · b112a670
      David Howells 提交于
      Set the request-ACK on more DATA packets whilst we're in slow start mode so
      that we get sufficient ACKs back to supply information to configure the
      window.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      b112a670
    • D
      rxrpc: Make Tx loss-injection go through normal return and adjust tracing · a1767077
      David Howells 提交于
      In rxrpc_send_data_packet() make the loss-injection path return through the
      same code as the transmission path so that the RTT determination is
      initiated and any future timer shuffling will be done, despite the packet
      having been binned.
      
      Whilst we're at it:
      
       (1) Add to the tx_data tracepoint an indication of whether or not we're
           retransmitting a data packet.
      
       (2) When we're deciding whether or not to request an ACK, rather than
           checking if we're in fast-retransmit mode check instead if we're
           retransmitting.
      
       (3) Don't invoke the lose_skb tracepoint when losing a Tx packet as we're
           not altering the sk_buff refcount nor are we just seeing it after
           getting it off the Tx list.
      
       (4) The rxrpc_skb_tx_lost note is then no longer used so remove it.
      
       (5) rxrpc_lose_skb() no longer needs to deal with rxrpc_skb_tx_lost.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a1767077
  13. 25 9月, 2016 2 次提交
    • D
      rxrpc: Implement slow-start · 57494343
      David Howells 提交于
      Implement RxRPC slow-start, which is similar to RFC 5681 for TCP.  A
      tracepoint is added to log the state of the congestion management algorithm
      and the decisions it makes.
      
      Notes:
      
       (1) Since we send fixed-size DATA packets (apart from the final packet in
           each phase), counters and calculations are in terms of packets rather
           than bytes.
      
       (2) The ACK packet carries the equivalent of TCP SACK.
      
       (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
           suited to SACK of a small number of packets.  It seems that, almost
           inevitably, by the time three 'duplicate' ACKs have been seen, we have
           narrowed the loss down to one or two missing packets, and the
           FLIGHT_SIZE calculation ends up as 2.
      
       (4) In rxrpc_resend(), if there was no data that apparently needed
           retransmission, we transmit a PING ACK to ask the peer to tell us what
           its Rx window state is.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      57494343
    • D
      rxrpc: Send an ACK after every few DATA packets we receive · 805b21b9
      David Howells 提交于
      Send an ACK if we haven't sent one for the last two packets we've received.
      This keeps the other end apprised of where we've got to - which is
      important if they're doing slow-start.
      
      We do this in recvmsg so that we can dispatch a packet directly without the
      need to wake up the background thread.
      
      This should possibly be made configurable in future.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      805b21b9
  14. 23 9月, 2016 3 次提交
  15. 22 9月, 2016 4 次提交
    • D
      rxrpc: Reduce the number of ACK-Requests sent · 0d4b103c
      David Howells 提交于
      Reduce the number of ACK-Requests we set on DATA packets that we're sending
      to reduce network traffic.  We set the flag on odd-numbered DATA packets to
      start off the RTT cache until we have at least three entries in it and then
      probe once per second thereafter to keep it topped up.
      
      This could be made tunable in future.
      
      Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
      packets as we transmit them and not stored statically in the sk_buff.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      0d4b103c
    • D
      rxrpc: Obtain RTT data by requesting ACKs on DATA packets · 50235c4b
      David Howells 提交于
      In addition to sending a PING ACK to gain RTT data, we can set the
      RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK.  The
      ACK packet contains the serial number of the packet it is in response to,
      so we can look through the Tx buffer for a matching DATA packet.
      
      This requires that the data packets be stamped with the time of
      transmission as a ktime rather than having the resend_at time in jiffies.
      
      This further requires the resend code to do the resend determination in
      ktimes and convert to jiffies to set the timer.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      50235c4b
    • D
      rxrpc: Send pings to get RTT data · 8e83134d
      David Howells 提交于
      Send a PING ACK packet to the peer when we get a new incoming call from a
      peer we don't have a record for.  The PING RESPONSE ACK packet will tell us
      the following about the peer:
      
       (1) its receive window size
      
       (2) its MTU sizes
      
       (3) its support for jumbo DATA packets
      
       (4) if it supports slow start (similar to RFC 5681)
      
       (5) an estimate of the RTT
      
      This is necessary because the peer won't normally send us an ACK until it
      gets to the Rx phase and we send it a packet, but we would like to know
      some of this information before we start sending packets.
      
      A pair of tracepoints are added so that RTT determination can be observed.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8e83134d
    • D
      rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs · 5a924b89
      David Howells 提交于
      Don't store the rxrpc protocol header in sk_buffs on the transmit queue,
      but rather generate it on the fly and pass it to kernel_sendmsg() as a
      separate iov.  This reduces the amount of storage required.
      
      Note that the security header is still stored in the sk_buff as it may get
      encrypted along with the data (and doesn't change with each transmission).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5a924b89
  16. 17 9月, 2016 6 次提交
  17. 14 9月, 2016 3 次提交
    • D
      rxrpc: Add IPv6 support · 75b54cb5
      David Howells 提交于
      Add IPv6 support to AF_RXRPC.  With this, AF_RXRPC sockets can be created:
      
      	service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);
      
      instead of:
      
      	service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
      
      The AFS filesystem doesn't support IPv6 at the moment, though, since that
      requires upgrades to some of the RPC calls.
      
      Note that a good portion of this patch is replacing "%pI4:%u" in print
      statements with "%pISpc" which is able to handle both protocols and print
      the port.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      75b54cb5
    • D
      rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually · 1c2bc7b9
      David Howells 提交于
      There are two places that want to transmit a packet in response to one just
      received and manually pick the address to reply to out of the sk_buff.
      Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled
      automatically.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      1c2bc7b9
    • D
      rxrpc: Correctly initialise, limit and transmit call->rx_winsize · 75e42126
      David Howells 提交于
      call->rx_winsize should be initialised to the sysctl setting and the sysctl
      setting should be limited to the maximum we want to permit.  Further, we
      need to place this in the ACK info instead of the sysctl setting.
      
      Furthermore, discard the idea of accepting the subpackets of a jumbo packet
      that lie beyond the receive window when the first packet of the jumbo is
      within the window.  Just discard the excess subpackets instead.  This
      allows the receive window to be opened up right to the buffer size less one
      for the dead slot.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      75e42126
  18. 08 9月, 2016 1 次提交
    • D
      rxrpc: Rewrite the data and ack handling code · 248f219c
      David Howells 提交于
      Rewrite the data and ack handling code such that:
      
       (1) Parsing of received ACK and ABORT packets and the distribution and the
           filing of DATA packets happens entirely within the data_ready context
           called from the UDP socket.  This allows us to process and discard ACK
           and ABORT packets much more quickly (they're no longer stashed on a
           queue for a background thread to process).
      
       (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim().  We instead
           keep track of the offset and length of the content of each packet in
           the sk_buff metadata.  This means we don't do any allocation in the
           receive path.
      
       (3) Jumbo DATA packet parsing is now done in data_ready context.  Rather
           than cloning the packet once for each subpacket and pulling/trimming
           it, we file the packet multiple times with an annotation for each
           indicating which subpacket is there.  From that we can directly
           calculate the offset and length.
      
       (4) A call's receive queue can be accessed without taking locks (memory
           barriers do have to be used, though).
      
       (5) Incoming calls are set up from preallocated resources and immediately
           made live.  They can than have packets queued upon them and ACKs
           generated.  If insufficient resources exist, DATA packet #1 is given a
           BUSY reply and other DATA packets are discarded).
      
       (6) sk_buffs no longer take a ref on their parent call.
      
      To make this work, the following changes are made:
      
       (1) Each call's receive buffer is now a circular buffer of sk_buff
           pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
           between the call and the socket.  This permits each sk_buff to be in
           the buffer multiple times.  The receive buffer is reused for the
           transmit buffer.
      
       (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
           to the data buffer.  Transmission phase annotations indicate whether a
           buffered packet has been ACK'd or not and whether it needs
           retransmission.
      
           Receive phase annotations indicate whether a slot holds a whole packet
           or a jumbo subpacket and, if the latter, which subpacket.  They also
           note whether the packet has been decrypted in place.
      
       (3) DATA packet window tracking is much simplified.  Each phase has just
           two numbers representing the window (rx_hard_ack/rx_top and
           tx_hard_ack/tx_top).
      
           The hard_ack number is the sequence number before base of the window,
           representing the last packet the other side says it has consumed.
           hard_ack starts from 0 and the first packet is sequence number 1.
      
           The top number is the sequence number of the highest-numbered packet
           residing in the buffer.  Packets between hard_ack+1 and top are
           soft-ACK'd to indicate they've been received, but not yet consumed.
      
           Four macros, before(), before_eq(), after() and after_eq() are added
           to compare sequence numbers within the window.  This allows for the
           top of the window to wrap when the hard-ack sequence number gets close
           to the limit.
      
           Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
           to indicate when rx_top and tx_top point at the packets with the
           LAST_PACKET bit set, indicating the end of the phase.
      
       (4) Calls are queued on the socket 'receive queue' rather than packets.
           This means that we don't need have to invent dummy packets to queue to
           indicate abnormal/terminal states and we don't have to keep metadata
           packets (such as ABORTs) around
      
       (5) The offset and length of a (sub)packet's content are now passed to
           the verify_packet security op.  This is currently expected to decrypt
           the packet in place and validate it.
      
           However, there's now nowhere to store the revised offset and length of
           the actual data within the decrypted blob (there may be a header and
           padding to skip) because an sk_buff may represent multiple packets, so
           a locate_data security op is added to retrieve these details from the
           sk_buff content when needed.
      
       (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
           individually secured and needs to be individually decrypted.  The code
           to do this is broken out into rxrpc_recvmsg_data() and shared with the
           kernel API.  It now iterates over the call's receive buffer rather
           than walking the socket receive queue.
      
      Additional changes:
      
       (1) The timers are condensed to a single timer that is set for the soonest
           of three timeouts (delayed ACK generation, DATA retransmission and
           call lifespan).
      
       (2) Transmission of ACK and ABORT packets is effected immediately from
           process-context socket ops/kernel API calls that cause them instead of
           them being punted off to a background work item.  The data_ready
           handler still has to defer to the background, though.
      
       (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
           filesystem can shut down the socket and flush its own work items
           before closing the socket to deal with any in-progress service calls.
      
      Future additional changes that will need to be considered:
      
       (1) Make sure that a call doesn't hog the front of the queue by receiving
           data from the network as fast as userspace is consuming it to the
           exclusion of other calls.
      
       (2) Transmit delayed ACKs from within recvmsg() when we've consumed
           sufficiently more packets to avoid the background work item needing to
           run.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      248f219c
  19. 07 9月, 2016 1 次提交
    • D
      rxrpc: Calls shouldn't hold socket refs · 8d94aa38
      David Howells 提交于
      rxrpc calls shouldn't hold refs on the sock struct.  This was done so that
      the socket wouldn't go away whilst the call was in progress, such that the
      call could reach the socket's queues.
      
      However, we can mark the socket as requiring an RCU release and rely on the
      RCU read lock.
      
      To make this work, we do:
      
       (1) rxrpc_release_call() removes the call's call user ID.  This is now
           only called from socket operations and not from the call processor:
      
      	rxrpc_accept_call() / rxrpc_kernel_accept_call()
      	rxrpc_reject_call() / rxrpc_kernel_reject_call()
      	rxrpc_kernel_end_call()
      	rxrpc_release_calls_on_socket()
      	rxrpc_recvmsg()
      
           Though it is also called in the cleanup path of
           rxrpc_accept_incoming_call() before we assign a user ID.
      
       (2) Pass the socket pointer into rxrpc_release_call() rather than getting
           it from the call so that we can get rid of uninitialised calls.
      
       (3) Fix call processor queueing to pass a ref to the work queue and to
           release that ref at the end of the processor function (or to pass it
           back to the work queue if we have to requeue).
      
       (4) Skip out of the call processor function asap if the call is complete
           and don't requeue it if the call is complete.
      
       (5) Clean up the call immediately that the refcount reaches 0 rather than
           trying to defer it.  Actual deallocation is deferred to RCU, however.
      
       (6) Don't hold socket refs for allocated calls.
      
       (7) Use the RCU read lock when queueing a message on a socket and treat
           the call's socket pointer according to RCU rules and check it for
           NULL.
      
           We also need to use the RCU read lock when viewing a call through
           procfs.
      
       (8) Transmit the final ACK/ABORT to a client call in rxrpc_release_call()
           if this hasn't been done yet so that we can then disconnect the call.
           Once the call is disconnected, it won't have any access to the
           connection struct and the UDP socket for the call work processor to be
           able to send the ACK.  Terminal retransmission will be handled by the
           connection processor.
      
       (9) Release all calls immediately on the closing of a socket rather than
           trying to defer this.  Incomplete calls will be aborted.
      
      The call refcount model is much simplified.  Refs are held on the call by:
      
       (1) A socket's user ID tree.
      
       (2) A socket's incoming call secureq and acceptq.
      
       (3) A kernel service that has a call in progress.
      
       (4) A queued call work processor.  We have to take care to put any call
           that we failed to queue.
      
       (5) sk_buffs on a socket's receive queue.  A future patch will get rid of
           this.
      
      Whilst we're at it, we can do:
      
       (1) Get rid of the RXRPC_CALL_EV_RELEASE event.  Release is now done
           entirely from the socket routines and never from the call's processor.
      
       (2) Get rid of the RXRPC_CALL_DEAD state.  Calls now end in the
           RXRPC_CALL_COMPLETE state.
      
       (3) Get rid of the rxrpc_call::destroyer work item.  Calls are now torn
           down when their refcount reaches 0 and then handed over to RCU for
           final cleanup.
      
       (4) Get rid of the rxrpc_call::deadspan timer.  Calls are cleaned up
           immediately they're finished with and don't hang around.
           Post-completion retransmission is handled by the connection processor
           once the call is disconnected.
      
       (5) Get rid of the dead call expiry setting as there's no longer a timer
           to set.
      
       (6) rxrpc_destroy_all_calls() can just check that the call list is empty.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8d94aa38
  20. 05 9月, 2016 1 次提交