1. 16 10月, 2018 2 次提交
  2. 09 10月, 2018 3 次提交
    • D
      rxrpc: Fix the packet reception routine · c1e15b49
      David Howells 提交于
      The rxrpc_input_packet() function and its call tree was built around the
      assumption that data_ready() handler called from UDP to inform a kernel
      service that there is data to be had was non-reentrant.  This means that
      certain locking could be dispensed with.
      
      This, however, turns out not to be the case with a multi-queue network card
      that can deliver packets to multiple cpus simultaneously.  Each of those
      cpus can be in the rxrpc_input_packet() function at the same time.
      
      Fix by adding or changing some structure members:
      
       (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
      
       (2) Make conn->service_id into a 32-bit variable so that it can be
           cmpxchg'd on all arches.
      
       (3) Add call->input_lock to serialise access to the Rx/Tx state.  Note
           that although the Rx and Tx states are (almost) entirely separate,
           there's no point completing the separation and having separate locks
           since it's a bi-phasal RPC protocol rather than a bi-direction
           streaming protocol.  Data transmission and data reception do not take
           place simultaneously on any particular call.
      
      and making the following functional changes:
      
       (1) In rxrpc_input_data(), hold call->input_lock around the core to
           prevent simultaneous producing of packets into the Rx ring and
           updating of tracking state for a particular call.
      
       (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
           check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
           The bit test and bit clear can then be combined.  No further locking
           is needed here.
      
       (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
           the ACK packet.  The superseded ACK check is then done both before and
           after the lock is taken.
      
           The handing of ackinfo data is split, parsing before the lock is taken
           and processing with it held.  This is keyed on rxMTU being non-zero.
      
           Congestion management is also done within the locked section.
      
       (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
           rotation.  The ACKALL packet carries no information and is only really
           useful after all packets have been transmitted since it's imprecise.
      
       (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
           prevent calls being simultaneously implicitly ended on two cpus and
           also to prevent any races with incoming call setup.
      
       (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
           on a connection.  It is only permitted to happen once for a
           connection.
      
       (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
           rx->incoming_lock to see if someone else set up the call, connection
           or peer whilst we were getting there.  We can't trust the values from
           the earlier routing check unless we pin refs on them - which we want
           to avoid.
      
           Further, we need to allow for an incoming call to have its state
           changed on another CPU between us making it live and us adjusting it
           because the conn is now in the RXRPC_CONN_SERVICE state.
      
       (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
           to the RTT buffer.  Don't need to lock around setting peer->rtt.
      
      For reference, the inventory of state-accessing or state-altering functions
      used by the packet input procedure is:
      
      > rxrpc_input_packet()
        * PACKET CHECKING
      
        * ROUTING
          > rxrpc_post_packet_to_local()
          > rxrpc_find_connection_rcu() - uses RCU
            > rxrpc_lookup_peer_rcu() - uses RCU
            > rxrpc_find_service_conn_rcu() - uses RCU
            > idr_find() - uses RCU
      
        * CONNECTION-LEVEL PROCESSING
          - Service upgrade
            - Can only happen once per conn
            ! Changed to use cmpxchg
          > rxrpc_post_packet_to_conn()
          - Setting conn->hi_serial
            - Probably safe not using locks
            - Maybe use cmpxchg
      
        * CALL-LEVEL PROCESSING
          > Old-call checking
            > rxrpc_input_implicit_end_call()
              > rxrpc_call_completed()
      	> rxrpc_queue_call()
      	! Need to take rx->incoming_lock
      	> __rxrpc_disconnect_call()
      	> rxrpc_notify_socket()
          > rxrpc_new_incoming_call()
            - Uses rx->incoming_lock for the entire process
              - Might be able to drop this earlier in favour of the call lock
            > rxrpc_incoming_call()
            	! Conflicts with rxrpc_input_implicit_end_call()
          > rxrpc_send_ping()
            - Don't need locks to check rtt state
            > rxrpc_propose_ACK
      
        * PACKET DISTRIBUTION
          > rxrpc_input_call_packet()
            > rxrpc_input_data()
      	* QUEUE DATA PACKET ON CALL
      	> rxrpc_reduce_call_timer()
      	  - Uses timer_reduce()
      	! Needs call->input_lock()
      	> rxrpc_receiving_reply()
      	  ! Needs locking around ack state
      	  > rxrpc_rotate_tx_window()
      	  > rxrpc_end_tx_phase()
      	> rxrpc_proto_abort()
      	> rxrpc_input_dup_data()
      	- Fills the Rx buffer
      	- rxrpc_propose_ACK()
      	- rxrpc_notify_socket()
      
            > rxrpc_input_ack()
      	* APPLY ACK PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_input_ping_response()
      	  - Probably doesn't need any extra locking
      	  ! Need READ_ONCE() on call->ping_serial
      	  > rxrpc_input_check_for_lost_ack()
      	    - Takes call->lock to consult Tx buffer
      	  > rxrpc_peer_add_rtt()
      	    ! Needs to take a lock (peer->rtt_input_lock)
      	    ! Could perhaps manage with cmpxchg() and xadd() instead
      	> rxrpc_input_requested_ack
      	  - Consults Tx buffer
      	    ! Probably needs a lock
      	  > rxrpc_peer_add_rtt()
      	> rxrpc_propose_ack()
      	> rxrpc_input_ackinfo()
      	  - Changes call->tx_winsize
      	    ! Use cmpxchg to handle change
      	    ! Should perhaps track serial number
      	  - Uses peer->lock to record MTU specification changes
      	> rxrpc_proto_abort()
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      	> rxrpc_input_soft_acks()
      	- Consults the Tx buffer
      	> rxrpc_congestion_management()
      	  - Modifies the Tx annotations
      	  ! Needs call->input_lock()
      	  > rxrpc_queue_call()
      
            > rxrpc_input_abort()
      	* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_set_call_completion()
      	> rxrpc_notify_socket()
      
            > rxrpc_input_ackall()
      	* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      
          > rxrpc_reject_packet()
      
      There are some functions used by the above that queue the packet, after
      which the procedure is terminated:
      
       - rxrpc_post_packet_to_local()
         - local->event_queue is an sk_buff_head
         - local->processor is a work_struct
       - rxrpc_post_packet_to_conn()
         - conn->rx_queue is an sk_buff_head
         - conn->processor is a work_struct
       - rxrpc_reject_packet()
         - local->reject_queue is an sk_buff_head
         - local->processor is a work_struct
      
      And some that offload processing to process context:
      
       - rxrpc_notify_socket()
         - Uses RCU lock
         - Uses call->notify_lock to call call->notify_rx
         - Uses call->recvmsg_lock to queue recvmsg side
       - rxrpc_queue_call()
         - call->processor is a work_struct
       - rxrpc_propose_ACK()
         - Uses call->lock to wrap __rxrpc_propose_ACK()
      
      And a bunch that complete a call, all of which use call->state_lock to
      protect the call state:
      
       - rxrpc_call_completed()
       - rxrpc_set_call_completion()
       - rxrpc_abort_call()
       - rxrpc_proto_abort()
         - Also uses rxrpc_queue_call()
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c1e15b49
    • D
      rxrpc: Fix connection-level abort handling · 64753092
      David Howells 提交于
      Fix connection-level abort handling to cache the abort and error codes
      properly so that a new incoming call can be properly aborted if it races
      with the parent connection being aborted by another CPU.
      
      The abort_code and error parameters can then be dropped from
      rxrpc_abort_calls().
      
      Fixes: f5c17aae ("rxrpc: Calls should only have one terminal state")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      64753092
    • D
      rxrpc: Only take the rwind and mtu values from latest ACK · 298bc15b
      David Howells 提交于
      Move the out-of-order and duplicate ACK packet check to before the call to
      rxrpc_input_ackinfo() so that the receive window size and MTU size are only
      checked in the latest ACK packet and don't regress.
      
      Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      298bc15b
  3. 08 10月, 2018 4 次提交
    • D
      rxrpc: Carry call state out of locked section in rxrpc_rotate_tx_window() · dfe99522
      David Howells 提交于
      Carry the call state out of the locked section in rxrpc_rotate_tx_window()
      rather than sampling it afterwards.  This is only used to select tracepoint
      data, but could have changed by the time we do the tracepoint.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dfe99522
    • D
      rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window() · c479d5f2
      David Howells 提交于
      We should only call the function to end a call's Tx phase if we rotated the
      marked-last packet out of the transmission buffer.
      
      Make rxrpc_rotate_tx_window() return an indication of whether it just
      rotated the packet marked as the last out of the transmit buffer, carrying
      the information out of the locked section in that function.
      
      We can then check the return value instead of examining RXRPC_CALL_TX_LAST.
      
      Fixes: 70790dbe ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c479d5f2
    • D
      rxrpc: Don't need to take the RCU read lock in the packet receiver · bfd28211
      David Howells 提交于
      We don't need to take the RCU read lock in the rxrpc packet receive
      function because it's held further up the stack in the IP input routine
      around the UDP receive routines.
      
      Fix this by dropping the RCU read lock calls from rxrpc_input_packet().
      This simplifies the code.
      
      Fixes: 70790dbe ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bfd28211
    • D
      rxrpc: Use the UDP encap_rcv hook · 5271953c
      David Howells 提交于
      Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
      in which a packet is placed onto the UDP receive queue and then immediately
      removed again by rxrpc.  Going via the queue in this manner seems like it
      should be unnecessary.
      
      This does, however, require the invention of a value to place in encap_type
      as that's one of the conditions to switch packets out to the encap_rcv
      hook.  Possibly the value doesn't actually matter for anything other than
      sockopts on the UDP socket, which aren't accessible outside of rxrpc
      anyway.
      
      This seems to cut a bit of time out of the time elapsed between each
      sk_buff being timestamped and turning up in rxrpc (the final number in the
      following trace excerpts).  I measured this by making the rxrpc_rx_packet
      trace point print the time elapsed between the skb being timestamped and
      the current time (in ns), e.g.:
      
      	... 424.278721: rxrpc_rx_packet: ...  ACK 25026
      
      So doing a 512MiB DIO read from my test server, with an unmodified kernel:
      
      	N       min     max     sum		mean    stddev
      	27605   2626    7581    7.83992e+07     2840.04 181.029
      
      and with the patch applied:
      
      	N       min     max     sum		mean    stddev
      	27547   1895    12165   6.77461e+07     2459.29 255.02
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5271953c
  4. 05 10月, 2018 2 次提交
    • D
      rxrpc: Fix the data_ready handler · 2cfa2271
      David Howells 提交于
      Fix the rxrpc_data_ready() function to pick up all packets and to not miss
      any.  There are two problems:
      
       (1) The sk_data_ready pointer on the UDP socket is set *after* it is
           bound.  This means that it's open for business before we're ready to
           dequeue packets and there's a tiny window exists in which a packet can
           sneak onto the receive queue, but we never know about it.
      
           Fix this by setting the pointers on the socket prior to binding it.
      
       (2) skb_recv_udp() will return an error (such as ENETUNREACH) if there was
           an error on the transmission side, even though we set the
           sk_error_report hook.  Because rxrpc_data_ready() returns immediately
           in such a case, it never actually removes its packet from the receive
           queue.
      
           Fix this by abstracting out the UDP dequeuing and checksumming into a
           separate function that keeps hammering on skb_recv_udp() until it
           returns -EAGAIN, passing the packets extracted to the remainder of the
           function.
      
      and two potential problems:
      
       (3) It might be possible in some circumstances or in the future for
           packets to be being added to the UDP receive queue whilst rxrpc is
           running consuming them, so the data_ready() handler might get called
           less often than once per packet.
      
           Allow for this by fully draining the queue on each call as (2).
      
       (4) If a packet fails the checksum check, the code currently returns after
           discarding the packet without checking for more.
      
           Allow for this by fully draining the queue on each call as (2).
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      2cfa2271
    • D
      rxrpc: Fix some missed refs to init_net · 5e33a23b
      David Howells 提交于
      Fix some refs to init_net that should've been changed to the appropriate
      network namespace.
      
      Fixes: 2baec2c3 ("rxrpc: Support network namespacing")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      5e33a23b
  5. 28 9月, 2018 7 次提交
    • D
      rxrpc: Fix error distribution · f3344303
      David Howells 提交于
      Fix error distribution by immediately delivering the errors to all the
      affected calls rather than deferring them to a worker thread.  The problem
      with the latter is that retries and things can happen in the meantime when we
      want to stop that sooner.
      
      To this end:
      
       (1) Stop the error distributor from removing calls from the error_targets
           list so that peer->lock isn't needed to synchronise against other adds
           and removals.
      
       (2) Require the peer's error_targets list to be accessed with RCU, thereby
           avoiding the need to take peer->lock over distribution.
      
       (3) Don't attempt to affect a call's state if it is already marked complete.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f3344303
    • D
      rxrpc: Fix transport sockopts to get IPv4 errors on an IPv6 socket · 37a675e7
      David Howells 提交于
      It seems that enabling IPV6_RECVERR on an IPv6 socket doesn't also turn on
      IP_RECVERR, so neither local errors nor ICMP-transported remote errors from
      IPv4 peer addresses are returned to the AF_RXRPC protocol.
      
      Make the sockopt setting code in rxrpc_open_socket() fall through from the
      AF_INET6 case to the AF_INET case to turn on all the AF_INET options too in
      the AF_INET6 case.
      
      Fixes: f2aeed3a ("rxrpc: Fix error reception on AF_INET6 sockets")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      37a675e7
    • D
      rxrpc: Make service call handling more robust · 0099dc58
      David Howells 提交于
      Make the following changes to improve the robustness of the code that sets
      up a new service call:
      
       (1) Cache the rxrpc_sock struct obtained in rxrpc_data_ready() to do a
           service ID check and pass that along to rxrpc_new_incoming_call().
           This means that I can remove the check from rxrpc_new_incoming_call()
           without the need to worry about the socket attached to the local
           endpoint getting replaced - which would invalidate the check.
      
       (2) Cache the rxrpc_peer struct, thereby allowing the peer search to be
           done once.  The peer is passed to rxrpc_new_incoming_call(), thereby
           saving the need to repeat the search.
      
           This also reduces the possibility of rxrpc_publish_service_conn()
           BUG()'ing due to the detection of a duplicate connection, despite the
           initial search done by rxrpc_find_connection_rcu() having turned up
           nothing.
      
           This BUG() shouldn't ever get hit since rxrpc_data_ready() *should* be
           non-reentrant and the result of the initial search should still hold
           true, but it has proven possible to hit.
      
           I *think* this may be due to __rxrpc_lookup_peer_rcu() cutting short
           the iteration over the hash table if it finds a matching peer with a
           zero usage count, but I don't know for sure since it's only ever been
           hit once that I know of.
      
           Another possibility is that a bug in rxrpc_data_ready() that checked
           the wrong byte in the header for the RXRPC_CLIENT_INITIATED flag
           might've let through a packet that caused a spurious and invalid call
           to be set up.  That is addressed in another patch.
      
       (3) Fix __rxrpc_lookup_peer_rcu() to skip peer records that have a zero
           usage count rather than stopping and returning not found, just in case
           there's another peer record behind it in the bucket.
      
       (4) Don't search the peer records in rxrpc_alloc_incoming_call(), but
           rather either use the peer cached in (2) or, if one wasn't found,
           preemptively install a new one.
      
      Fixes: 8496af50 ("rxrpc: Use RCU to access a peer's service connection tree")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      0099dc58
    • D
      rxrpc: Improve up-front incoming packet checking · 403fc2a1
      David Howells 提交于
      Do more up-front checking on incoming packets to weed out invalid ones and
      also ones aimed at services that we don't support.
      
      Whilst we're at it, replace the clearing of call and skew if we don't find
      a connection with just initialising the variables to zero at the top of the
      function.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      403fc2a1
    • D
      rxrpc: Emit BUSY packets when supposed to rather than ABORTs · ece64fec
      David Howells 提交于
      In the input path, a received sk_buff can be marked for rejection by
      setting RXRPC_SKB_MARK_* in skb->mark and, if needed, some auxiliary data
      (such as an abort code) in skb->priority.  The rejection is handled by
      queueing the sk_buff up for dealing with in process context.  The output
      code reads the mark and priority and, theoretically, generates an
      appropriate response packet.
      
      However, if RXRPC_SKB_MARK_BUSY is set, this isn't noticed and an ABORT
      message with a random abort code is generated (since skb->priority wasn't
      set to anything).
      
      Fix this by outputting the appropriate sort of packet.
      
      Also, whilst we're at it, most of the marks are no longer used, so remove
      them and rename the remaining two to something more obvious.
      
      Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ece64fec
    • D
      rxrpc: Fix RTT gathering · b604dd98
      David Howells 提交于
      Fix RTT information gathering in AF_RXRPC by the following means:
      
       (1) Enable Rx timestamping on the transport socket with SO_TIMESTAMPNS.
      
       (2) If the sk_buff doesn't have a timestamp set when rxrpc_data_ready()
           collects it, set it at that point.
      
       (3) Allow ACKs to be requested on the last packet of a client call, but
           not a service call.  We need to be careful lest we undo:
      
      	bf7d620a
      	Author: David Howells <dhowells@redhat.com>
      	Date:   Thu Oct 6 08:11:51 2016 +0100
      	rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase
      
           but that only really applies to service calls that we're handling,
           since the client side gets to send the final ACK (or not).
      
       (4) When about to transmit an ACK or DATA packet, record the Tx timestamp
           before only; don't update the timestamp afterwards.
      
       (5) Switch the ordering between recording the serial and recording the
           timestamp to always set the serial number first.  The serial number
           shouldn't be seen referenced by an ACK packet until we've transmitted
           the packet bearing it - so in the Rx path, we don't need the timestamp
           until we've checked the serial number.
      
      Fixes: cf1a6474 ("rxrpc: Add per-peer RTT tracker")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      b604dd98
    • D
      rxrpc: Fix checks as to whether we should set up a new call · dc71db34
      David Howells 提交于
      There's a check in rxrpc_data_ready() that's checking the CLIENT_INITIATED
      flag in the packet type field rather than in the packet flags field.
      
      Fix this by creating a pair of helper functions to check whether the packet
      is going to the client or to the server and use them generally.
      
      Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dc71db34
  6. 27 9月, 2018 1 次提交
  7. 12 8月, 2018 1 次提交
  8. 09 8月, 2018 1 次提交
    • D
      rxrpc: Fix the keepalive generator [ver #2] · 330bdcfa
      David Howells 提交于
      AF_RXRPC has a keepalive message generator that generates a message for a
      peer ~20s after the last transmission to that peer to keep firewall ports
      open.  The implementation is incorrect in the following ways:
      
       (1) It mixes up ktime_t and time64_t types.
      
       (2) It uses ktime_get_real(), the output of which may jump forward or
           backward due to adjustments to the time of day.
      
       (3) If the current time jumps forward too much or jumps backwards, the
           generator function will crank the base of the time ring round one slot
           at a time (ie. a 1s period) until it catches up, spewing out VERSION
           packets as it goes.
      
      Fix the problem by:
      
       (1) Only using time64_t.  There's no need for sub-second resolution.
      
       (2) Use ktime_get_seconds() rather than ktime_get_real() so that time
           isn't perceived to go backwards.
      
       (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
           parts:
      
           (a) The "worker" function that manages the buckets and the timer.
      
           (b) The "dispatch" function that takes the pending peers and
           	 potentially transmits a keepalive packet before putting them back
           	 in the ring into the slot appropriate to the revised last-Tx time.
      
       (4) Taking everything that's pending out of the ring and splicing it into
           a temporary collector list for processing.
      
           In the case that there's been a significant jump forward, the ring
           gets entirely emptied and then the time base can be warped forward
           before the peers are processed.
      
           The warping can't happen if the ring isn't empty because the slot a
           peer is in is keepalive-time dependent, relative to the base time.
      
       (5) Limit the number of iterations of the bucket array when scanning it.
      
       (6) Set the timer to skip any empty slots as there's no point waking up if
           there's nothing to do yet.
      
      This can be triggered by an incoming call from a server after a reboot with
      AF_RXRPC and AFS built into the kernel causing a peer record to be set up
      before userspace is started.  The system clock is then adjusted by
      userspace, thereby potentially causing the keepalive generator to have a
      meltdown - which leads to a message like:
      
      	watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
      	...
      	Workqueue: krxrpcd rxrpc_peer_keepalive_worker
      	EIP: lock_acquire+0x69/0x80
      	...
      	Call Trace:
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? _raw_spin_lock_bh+0x29/0x60
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? __lock_acquire+0x3d3/0x870
      	 ? process_one_work+0x110/0x340
      	 ? process_one_work+0x166/0x340
      	 ? process_one_work+0x110/0x340
      	 ? worker_thread+0x39/0x3c0
      	 ? kthread+0xdb/0x110
      	 ? cancel_delayed_work+0x90/0x90
      	 ? kthread_stop+0x70/0x70
      	 ? ret_from_fork+0x19/0x24
      
      Fixes: ace45bec ("rxrpc: Fix firewall route keepalive")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      330bdcfa
  9. 04 8月, 2018 2 次提交
  10. 03 8月, 2018 1 次提交
  11. 02 8月, 2018 1 次提交
  12. 01 8月, 2018 9 次提交
    • D
      rxrpc: Transmit more ACKs during data reception · d0b35a42
      David Howells 提交于
      Immediately flush any outstanding ACK on entry to rxrpc_recvmsg_data() -
      which transfers data to the target buffers - if we previously had an Rx
      underrun (ie. we returned -EAGAIN because we ran out of received data).
      This lets the server know what we've managed to receive something.
      
      Also flush any outstanding ACK after calling the function if it hit -EAGAIN
      to let the server know we processed some data.
      
      It might be better to send more ACKs, possibly on a time-based scheme, but
      that needs some more consideration.
      
      With this and some additional AFS patches, it is possible to get large
      unencrypted O_DIRECT reads to be almost as fast as NFS over TCP.  It looks
      like it might be theoretically possible to improve performance yet more for
      a server running a single operation as investigation of packet timestamps
      indicates that the server keeps stalling.
      
      The issue appears to be that rxrpc runs in to trouble with ACK packets
      getting batched together (up to ~32 at a time) somewhere between the IP
      transmit queue on the client and the ethernet receive queue on the server.
      
      However, this case isn't too much of a worry as even a lightly loaded
      server should be receiving sufficient packet flux to flush the ACK packets
      to the UDP socket.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d0b35a42
    • D
      rxrpc: Propose, but don't immediately transmit, the final ACK for a call · a71a2651
      David Howells 提交于
      The final ACK that closes out an rxrpc call needs to be transmitted by the
      client unless we're going to follow up with a DATA packet for a new call on
      the same channel (which implicitly ACK's the previous call, thereby saving
      an ACK).
      
      Currently, we don't do that, so if no follow on call is immediately
      forthcoming, the server will resend the last DATA packet - at which point
      rxrpc_conn_retransmit_call() will be triggered and will (re)send the final
      ACK.  But the server has to hold on to the last packet until the ACK is
      received, thereby holding up its resources.
      
      Fix the client side to propose a delayed final ACK, to be transmitted after
      a short delay, assuming the call isn't superseded by a new one.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a71a2651
    • D
      rxrpc: Increase the size of a call's Rx window · 4075295a
      David Howells 提交于
      Increase the size of a call's Rx window from 32 to 63 - ie. one less than
      the size of the ring buffer.  This makes large data transfers perform
      better when the Tx window on the other side is around 64 (as is the case
      with Auristor's YFS fileserver).
      
      If the server window size is ~32 or smaller, this should make no
      difference.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4075295a
    • D
      rxrpc: Trace socket notification · 4272d303
      David Howells 提交于
      Trace notifications from the softirq side of the socket to the
      process-context side.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4272d303
    • D
      rxrpc: Trace packet transmission · 4764c0da
      David Howells 提交于
      Trace successful packet transmission (kernel_sendmsg() succeeded, that is)
      in AF_RXRPC.  We can share the enum that defines the transmission points
      with the trace_rxrpc_tx_fail() tracepoint, so rename its constants to be
      applicable to both.
      
      Also, save the internal call->debug_id in the rxrpc_channel struct so that
      it can be used in retransmission trace lines.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4764c0da
    • D
      rxrpc: Fix the trace for terminal ACK (re)transmission · f3f8337c
      David Howells 提交于
      Fix the trace for terminal ACK (re)transmission to put in the right
      parameters.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f3f8337c
    • D
      rxrpc: Show some more information through /proc files · 6b97bd7a
      David Howells 提交于
      Show the four current call IDs in /proc/net/rxrpc/conns.
      
      Show the current packet Rx serial number in /proc/net/rxrpc/calls.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6b97bd7a
    • D
      rxrpc: Display call expect-receive-by timeout in proc · 887763bb
      David Howells 提交于
      Display in /proc/net/rxrpc/calls the timeout by which a call next expects
      to receive a packet.
      
      This makes it easier to debug timeout issues.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      887763bb
    • Y
      rxrpc: remove redundant variables 'sp' and 'did_discard' · f597a579
      YueHaibing 提交于
      Variables 'sp' and 'did_discard' are being assigned,
      but are never used, hence they are redundant and can be removed.
      
      fix following warning:
      
      net/rxrpc/call_event.c:165:25: warning: variable 'sp' set but not used [-Wunused-but-set-variable]
      net/rxrpc/conn_client.c:1054:7: warning: variable 'did_discard' set but not used [-Wunused-but-set-variable]
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f597a579
  13. 31 7月, 2018 1 次提交
  14. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  15. 21 6月, 2018 1 次提交
  16. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  17. 07 6月, 2018 1 次提交
  18. 05 6月, 2018 1 次提交
    • D
      rxrpc: Fix handling of call quietly cancelled out on server · 1a025028
      David Howells 提交于
      Sometimes an in-progress call will stop responding on the fileserver when
      the fileserver quietly cancels the call with an internally marked abort
      (RX_CALL_DEAD), without sending an ABORT to the client.
      
      This causes the client's call to eventually expire from lack of incoming
      packets directed its way, which currently leads to it being cancelled
      locally with ETIME.  Note that it's not currently clear as to why this
      happens as it's really hard to reproduce.
      
      The rotation policy implement by kAFS, however, doesn't differentiate
      between ETIME meaning we didn't get any response from the server and ETIME
      meaning the call got cancelled mid-flow.  The latter leads to an oops when
      fetching data as the rotation partially resets the afs_read descriptor,
      which can result in a cleared page pointer being dereferenced because that
      page has already been filled.
      
      Handle this by the following means:
      
       (1) Set a flag on a call when we receive a packet for it.
      
       (2) Store the highest packet serial number so far received for a call
           (bearing in mind this may wrap).
      
       (3) If, when the "not received anything recently" timeout expires on a
           call, we've received at least one packet for a call and the connection
           as a whole has received packets more recently than that call, then
           cancel the call locally with ECONNRESET rather than ETIME.
      
           This indicates that the call was definitely in progress on the server.
      
       (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
           don't try the next server, but rather abort the call.
      
           This avoids the oops as we don't try to reuse the afs_read struct.
           Rather, as-yet ungotten pages will be reread at a later data.
      
      Also:
      
       (5) Add an rxrpc tracepoint to log detection of the call being reset.
      
      Without this, I occasionally see an oops like the following:
      
          general protection fault: 0000 [#1] SMP PTI
          ...
          RIP: 0010:_copy_to_iter+0x204/0x310
          RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
          RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
          RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
          RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
          R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
          R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
          FS:  00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
          Call Trace:
           skb_copy_datagram_iter+0x14e/0x289
           rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
           ? trace_buffer_unlock_commit_regs+0x4f/0x89
           rxrpc_kernel_recv_data+0x149/0x421
           afs_extract_data+0x1e0/0x798
           ? afs_wait_for_call_to_complete+0xc9/0x52e
           afs_deliver_fs_fetch_data+0x33a/0x5ab
           afs_deliver_to_call+0x1ee/0x5e0
           ? afs_wait_for_call_to_complete+0xc9/0x52e
           afs_wait_for_call_to_complete+0x12b/0x52e
           ? wake_up_q+0x54/0x54
           afs_make_call+0x287/0x462
           ? afs_fs_fetch_data+0x3e6/0x3ed
           ? rcu_read_lock_sched_held+0x5d/0x63
           afs_fs_fetch_data+0x3e6/0x3ed
           afs_fetch_data+0xbb/0x14a
           afs_readpages+0x317/0x40d
           __do_page_cache_readahead+0x203/0x2ba
           ? ondemand_readahead+0x3a7/0x3c1
           ondemand_readahead+0x3a7/0x3c1
           generic_file_buffered_read+0x18b/0x62f
           __vfs_read+0xdb/0xfe
           vfs_read+0xb2/0x137
           ksys_read+0x50/0x8c
           do_syscall_64+0x7d/0x1a0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note the weird value in RDI which is a result of trying to kmap() a NULL
      page pointer.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a025028