1. 10 1月, 2017 40 次提交
    • G
      bpf: allow adjusted map element values to spill · f0318d01
      Gianluca Borello 提交于
      commit 48461135 ("bpf: allow access into map value arrays")
      introduces the ability to do pointer math inside a map element value via
      the PTR_TO_MAP_VALUE_ADJ register type.
      
      The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
      is spilled into the stack, limiting several use cases, especially when
      generating bpf code from a compiler.
      
      Handle this case by explicitly enabling the register type
      PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
      max_value are reset just for BPF_LDX operations that don't result in a
      restore of a spilled register from stack.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0318d01
    • G
      bpf: allow helpers access to map element values · 5722569b
      Gianluca Borello 提交于
      Enable helpers to directly access a map element value by passing a
      register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
      arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.
      
      This enables several use cases. For example, a typical tracing program
      might want to capture pathnames passed to sys_open() with:
      
      struct trace_data {
      	char pathname[PATHLEN];
      };
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	struct trace_data data;
      	bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);
      
      	/* consume data.pathname, for example via
      	 * bpf_trace_printk() or bpf_perf_event_output()
      	 */
      }
      
      Such a program could easily hit the stack limit in case PATHLEN needs to
      be large or more local variables need to exist, both of which are quite
      common scenarios. Allowing direct helper access to map element values,
      one could do:
      
      struct bpf_map_def SEC("maps") scratch_map = {
      	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
      	.key_size = sizeof(u32),
      	.value_size = sizeof(struct trace_data),
      	.max_entries = 1,
      };
      
      SEC("kprobe/sys_open")
      int bpf_sys_open(struct pt_regs *ctx)
      {
      	int id = 0;
      	struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
      	if (!p)
      		return;
      	bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);
      
      	/* consume p->pathname, for example via
      	 * bpf_trace_printk() or bpf_perf_event_output()
      	 */
      }
      
      And wouldn't risk exhausting the stack.
      
      Code changes are loosely modeled after commit 6841de8b ("bpf: allow
      helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
      changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
      ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
      trivial, but since there is not currently a use case for that, it's
      reasonable to limit the set of changes.
      
      Also, add new tests to make sure accesses to map element values from
      helpers never go out of boundary, even when adjusted.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5722569b
    • G
      bpf: split check_mem_access logic for map values · dbcfe5f7
      Gianluca Borello 提交于
      Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from
      check_mem_access() to a separate helper check_map_access_adj(). This
      enables to use those checks in other parts of the verifier as well,
      where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for
      example when checking helper function arguments. The same thing is
      already happening for other types such as PTR_TO_PACKET and its
      check_packet_access() helper.
      
      The code has been copied verbatim, with the only difference of removing
      the "off += reg->max_value" statement and moving the sum into the call
      statement to check_map_access(), as that was only needed due to the
      earlier common check_map_access() call.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbcfe5f7
    • D
      Merge branch 'net-smc' · f3a3e248
      David S. Miller 提交于
      Ursula Braun says:
      
      ====================
      net/smc: Shared Memory Communications - RDMA
      
      here is now V4 of the SMC-R patches having processed your feedback from end
      of November. The most important change is the replacement of sysfs by a
      generic netlink solution in patch 04. And I tried to get rid of the __packed
      attributes. There are still a few usages left due to SMC-R protocol defined
      structures.
      
      V4 changes:
      The order of patches 03 and 04 for pnet table management and SMC IB-client
      establishing has been exchanged, since pnet table management is now built on
      top of smc_ib_devices.
      Patch 01: Use EXPORT_SYMBOL_GPL().
      Patch 02: Define "use_fallback" as bool.
                Get rid of useless smc_sock fields clearing in smc_sock_alloc(),
                since sk_alloc() clears out the memory.
      Patch 03: Postpone smc_ib_remember_port_attr() call till ib_device is
                mentioned in the pnet table.
      Patch 04: Replace sysfs-usage by a generic netlink approach for pnet table
                configuration.
                Change layout of pnet table entries to reference net_device and
                ib_device instead of dealing with names of net_devices and
                ib_devices.
      Patch 05: Adapt "use_fallback" usages to new type bool.
                Get rid of useless smc_sock fields clearing in smc_sock_alloc()
                Avoid __packed where possible.
                Check if clc responses are not too big.
      Patch 09: Postpone smc_setup_per_ibdev till the first connection with this
                ib_device is really created.
      Patch 11: Get rid of __packed usage.
      
      V3 changes:
      Patch 05: Remove unneeded DEFINE_WAIT
      Patch 06: Improve synchronization of link group creation
      Patch 07: Rename peer_rmbe_len into peer_rmbe_size to be more consistent
      Patch 09: Avoid calls of ib_get_memory_region with IB_ACCESS_LOCAL_WRITE,
                use new default local_dma_lkey from protection domain as lkey
                instead.
                Remove no longer needed function smc_ib_dereg_memory_region().
      Patch 14: Switch to state ACTIVE only if still in state INIT.
                Return 0 for recvmsg invoked in a socket closing state.
                Allow getname call in state APPCLOSEWAIT1
                Do not trigger destruction of a socket-in-error queued in accept
                queue.
                During cleanup of accept queue, make sure sockets are destructed,
                and sockets in fallback mode are handled appropriately.
                When freeing sndbufs/rmbs, remove them from their list and free
                the entry.
                Use add_wait_queue() and remove_wait_queue() in close wait
                functions.
                If actively closing a socket in state for PEERFINCLOSEWAIT, keep
                this state.
                If passively closing a socket while bytes are to be received, move
                to state APPCLOSEWAIT1.
                If actively aborting a socket, skip sending the close_abort flag,
                since RDMA communication is no longer possible.
                When terminating a link group, do not schedule link group freeing a
                2nd time, since already done when unregistering the last remaining
                connection.
      Patch 15: Introduce smc_diag module for monitoring SMC protocol sockets.
                This replaces the old patch 0015 dealing with procfs.
      
      V2 changes:
      Patch 0002: Add SMC versions for family key strings in net/core/sock.c.
      Patch 0006: initialize rb_tree.
      Patch 0007: Get rid of unneeded use of xchg() in smc_sndbuf_unuse() and
                  smc_rmb_unuse().
      Patch 0008: Correct error checking logic for ib_function calls.
                  Define struct smc_link field wr_tx_id as atomic_long_t.
                  Use "do_div" instead of "%" to be architecture-independent.
      Patch 0009: Correct error checking logic for ib_function calls.
      Patch 0011: Remove xchg() calls in cursor handling. Use atomic64_t for cursor
                  overlays on 64-bit architectures. If not available, use plain u64
                  and add locking for cursor reading and writing.
                  Implement smc_curs_add() without modulo operator "%".
      Patch 0012: Remove xchg() calls in cursor handling.
                  Implement smc_tx_rdma_writes() without module operator "%".
      Patch 0013: Remove xchg() calls in cursor handling.
      Patch 0014: Return type bool in smc_wr_tx_has_pending().
                  Remove unneeded semicolon in smc_close_shutdown_write().
                  Call smc_close_active() in non-fallback case only.
                  Get rid of duplicate schedule of sock_put_work().
                  Take nested sock_lock in smc_listen_work().
                  Start close stream_wait in case of prepared sends only.
      Patch 0015: Remove unneeded socket ref_count in smc_proc_seq_show().
                  Take lock before list_empty check in smc_proc_sock_list_del().
      
      These patches are the initial part of the implementation of the
      "Shared Memory Communications-RDMA" (SMC-R) protocol as defined in
      RFC7609 [1]. While SMC-R does not aim to replace TCP,
      it taps a wealth of existing data center TCP socket applications
      to become more efficient without the need for rewriting them.
      SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption.
      For instance, when running 10 parallel connections with uperf, we measured
      a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP
      (with throughput and latency comparable;
      measured on x86_64 with the same RoCE card and port).
      
      SMC-R does not require an RDMA communication manager (RDMA CM).
      
      SMC-R inherits TCP qualities such as reliable connections, host-based
      firewall packet filtering (on connection establishment) and unmodified
      application of communication encryption such as TLS (transport layer
      security) or SSL (secure sockets layer). Since original TCP is used to
      establish SMC-R connections, load balancers and packet inspection based
      on TCP/IP connection establishment continue to work for SMC-R.
      
      On the other hand, using SMC-R implies:
      - either involving a preload library when invoking the unchanged TCP-application
        or slightly modifying the source by simply changing the socket family in
        the socket() call
      - accepting extra overhead and latency in connection establishment due to
        SMC Connection Layer Control (CLC) handshake
      - explicit coupling of RoCE ports with Ethernet ports
      - not routable as currently built on RoCE V1
      - bypassing of packet-based networking features
          - filtering (netfilter)
          - sniffing (libpcap, packet sockets, (E)BPF)
          - traffic control (scheduling, shaping)
      - bypassing of IP-header based socket options
      - bypassing of memory buffer (pressure) management
      - unusable together with IPsec
      
      Overview of the SMC-R Protocol described in informational RFC 7609
      
      SMC-R is an open protocol that provides RDMA capabilities over RoCE
      transparently for applications exploiting TCP sockets.
      A new socket protocol family PF_SMC is introduced.
      There are no changes required to applications using the sockets API for TCP
      stream sockets other than the specification of the new socket family AF_SMC.
      Unmodified applications can be used by means of a dynamic preload shared
      library which rewrites the socket API call
      socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into
      socket(AF_SMC,  SOCK_STREAM, IPPROTO_TCP).
      SMC-R re-uses the address family AF_INET for all addressing purposes around
      struct sockaddr.
      
      SMC-R system architecture layers:
      
      +=============================================================================+
      |                                      | unmodified TCP application           |
      | native SMC application               +--------------------------------------+
      |                                      | dynamic preload shared library       |
      +=============================================================================+
      |                                 SMC socket                                  |
      +-----------------------------------------------------------------------------+
      |                    | TCP socket (for connection establishment and fallback) |
      | IB verbs           +--------------------------------------------------------+
      |                    | IP                                                     |
      +--------------------+--------------------------------------------------------+
      | RoCE device driver | some network device driver                             |
      +=============================================================================+
      
      Terms:
      
      A link group is determined by an ordered peer pair of TCP client and TCP server
      (IP addresses and subnet). Reversed client server roles cause an own link group.
      A link is a logical point-to-point connection based on an
      infiniband reliable connected queue pair (RC-QP) between two RoCE ports
      (MACs and GIDs) of a peer pair.
      A link group can have 1..8 links for failover and load balancing.
      This initial Linux implementation always has 1 link per link group.
      Each link group on a peer can have 1..255 remote memory buffers (RMBs).
      If more RMBs are needed, a peer can open another link group
      (this initial Linux implementation) or fall back to TCP.
      Each RMB has its own particular size and its own (R)DMA mapping and credentials
      (rtoken consisting of rkey and RDMA "virtual address").
      This initial Linux implementation uses physically contiguous memory for RMBs
      but we are working towards scattered memory because of memory fragmentation.
      Each RMB has 1..255 RMB elements (RMBEs) of equal size
      to provide multiplexing of connections within an RMB.
      An RMBE is the RDMA Write destination organized as wrapping ring buffer
      for data transmit of a particular connection in one direction
      (duplex by means of mirror symmetry as with TCP).
      This initial Linux implementation always has 1 RMBE per RMB
      and thus an individual RMB for each connection.
      
      SMC-R connection establishment with subsequent data transfer:
      
         CLIENT                                                   SERVER
      
      TCP three-way handshake:
                               regular TCP SYN
            -------------------------------------------------------->
                             regular TCP SYN ACK
            <--------------------------------------------------------
                               regular TCP ACK
            -------------------------------------------------------->
      
      SMC Connection Layer Control (CLC) handshake
      exchanges RDMA credentials between peers:
                   via above TCP connection: SMC CLC Proposal
            -------------------------------------------------------->
                    via above TCP connection: SMC CLC Accept
            <--------------------------------------------------------
                   via above TCP connection: SMC CLC Confirm
            -------------------------------------------------------->
      
      SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group):
                       RoCE RC-QP: SMC LLC Confirm Link
            <========================================================
                   RoCE RC-QP: SMC LLC Confirm Link response
            ========================================================>
      
      SMC data transmission (incl. SMC Connection Data Control (CDC) message):
                             RoCE RC-QP: RDMA Write
            ========================================================>
                   RoCE RC-QP: SMC CDC message (flow control)
            ========================================================>
                                ...
      
                             RoCE RC-QP: RDMA Write
            <========================================================
                   RoCE RC-QP: SMC CDC message (flow control)
            <========================================================
                                ...
      
      Data flow within an established connection:
      
      +----------------------------------------------------------------------------
      |            SENDER
      | sendmsg()
      |    |
      |    | produces into sndbuf [sender's process context]
      |    v
      | +--------+
      | | sndbuf | [ring buffer]
      | +--------+
      |    |
      |    | consumes from sndbuf and produces into receiver's RMBE [any context]
      |    | by sending RDMA Write followed by SMC CDC message over RoCE RC-QP
      |    |
      +----|-----------------------------------------------------------------------
           |
      +----|-----------------------------------------------------------------------
      |    v       RECEIVER
      | +------+
      | | RMBE | [ring buffer, can have size different from sender's sndbuf]
      | |      | [RMBE represents rcvbuf, no further de-coupling as on sender side]
      | +------+
      |    |
      |    | consumes from RMBE [receiver's process context]
      |    v
      | recvmsg()
      +----------------------------------------------------------------------------
      
      Flow control ("cursor" updates) by means of SMC CDC messages:
      
                     SENDER                            RECEIVER
      
              sends updates via CDC-------------+   sends updates via CDC
              on consuming from sndbuf          |   on consuming from RMBE
              and producing into RMBE           |   by means of recvmsg()
                                                |            |
                                                |            |
            +-----------------------------------|------------+
            |                                   |
         +--v-------------------------+      +--v-----------------------+
         | receiver's consumer cursor |      | sender's producer cursor----+
         +----------------|-----------+      +--------------------------+  |
                          |                                                |
                          |                        receiver's RMBE         |
                          |                  +--------------------------+  |
                          |                  |                          |  |
                          +--------------------------------+            |  |
                                             |             |            |  |
                                             |             v            |  |
                                             |             +------------|  |
                                             |-------------+////////////|  |
                                             |//RDMA data written by////|  |
                                             |////sender that is////////|  |
                                             |/available to be consumed/|  |
                                             |///////// +---------------|  |
                                             |----------+^              |  |
                                             |           |              |  |
                                             |           +-----------------+
                                             |                          |
                                             +--------------------------+
      
      Sending updates of the producer cursor is immediate for low latency;
      something like Nagle's algorithm (absence of TCP_NODELAY) is optional and
      currently not part of this initial Linux implementation.
      Sending updates of the consumer cursor is conditional to avoid the
      silly window syndrome.
      
      Normal connection termination:
      
      Normal connection termination starts transitioning from socket state
      ACTIVE via either "Active Close" or "Passive Close".
      
      shutdown rdwr               +-----------------+
      or close,   +-------------->|  INIT / CLOSED  |<-------------+
      send PeerCon|nClosed        +-----------------+              | PeerConnClosed
                  |                       |                        | received
                  |            connection | established            |
                  |                       V                        |
          +----------------+     +-----------------+     +----------------+
          |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
          +----------------+     +-----------------+     +----------------+
                  |                   |         |                   |
                  |     Active Close: |         |Passive Close:     |
                  |     close or      |         |PeerConnClosed or  |
                  |     shutdown wr or|         |PeerDoneWriting    |
                  |     shutdown rdwr |         |received           |
      
          |                   V         V                   |
       PeerConnClo|sed    +--------------+   +-------------+        | close or
       received   +--<----|PeerCloseWait1|   |AppCloseWait1|--->----+ shutdown rdwr,
                  |       +--------------+   +-------------+        | send
                  |  PeerDoneWri|ting                | shutdown wr, | PeerConnClosed
                  |  received   |            send Pee|rDoneWriting  |
                  |             V                    V              |
                  |       +--------------+   +-------------+        |
                  +--<----|PeerCloseWait2|   |AppCloseWait2|--->----+
                          +--------------+   +-------------+
      
      In state CLOSED, the socket can be destructed only, once the application has
      issued a close().
      
      Abnormal connection termination:
      
                                  +-----------------+
                  +-------------->|  INIT / CLOSED  |<-------------+
                  |               +-----------------+              |
                  |                                                |
                  |           +-----------------------+            |
                  |           |     Any state         |            |
       PeerConnAbo|rt         | (before setting       |            | send
       received   |           |  PeerConnClosed       |            | PeerConnAbort
                  |           |  indicator in         |            |
                  |           |  peer's RMBE)         |            |
                  |           +-----------------------+            |
                  |                   |         |                  |
                  |     Active Abort: |         | Passive Abort:   |
                  |     problem,      |         | PeerConnAbort    |
                  |     send          |         | received,        |
                  |     PeerConnAbort,|         | ECONNRESET       |
                  |     ECONNABORTED  |         |                  |
                  |                   V         V                  |
                  |       +--------------+   +--------------+      |
                  +-------|PeerAbortWait |   | ProcessAbort |------+
                          +--------------+   +--------------+
      
      Implementation notes beyond RFC 7609:
      
      A PNET table in sysfs provides the mapping between network device names and
      RoCE Infiniband device names for the transparent switch of data communication.
      A PNET table can contain an arbitrary number of PNETIDs.
      Each PNETID contains exactly one (Ethernet) network device name
      and one or more RoCE Infiniband device names.
      Each device name can only exist in at most one PNETID (no overlapping).
      This initial Linux implementation allows at most one RoCE Infiniband device
      name per PNETID.
      After a new TCP connection is established, the network device name
      used for egress traffic with the TCP connection's local source IP address
      is used as key to lookup the unique PNETID, and the RoCE Infiniband device
      of this PNETID is used to switch data communication from TCP to RDMA
      during SMC CLC handshake.
      
      Problem determination:
      
      A protocol dissector is available with upstream wireshark for formatting
      SMC-R related RoCE LAN traffic.
      [https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c]
      
      We are working on enhancing the Linux implementation to cover:
      
      - Improve default socket closing asynchronicity
      - Address corner cases with many parallel connections
      - Tracing
      - Integrated load balancing and fail-over within a link group
      - Splice and sendpage support
      - IPv6 addressing support
      - Keepalive, Cork
      - Namespaces support
      - Urgent data
      - More socket options
      - Diagnostics
      - Statistics support
      - SNMP support
      
      References:
      
      [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3a3e248
    • U
      smc: netlink interface for SMC sockets · f16a7dd5
      Ursula Braun 提交于
      Support for SMC socket monitoring via netlink sockets of protocol
      NETLINK_SOCK_DIAG.
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f16a7dd5
    • U
      smc: socket closing and linkgroup cleanup · b38d7324
      Ursula Braun 提交于
      smc_shutdown() and smc_release() handling
      delayed linkgroup cleanup for linkgroups without connections
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b38d7324
    • U
      smc: receive data from RMBE · 952310cc
      Ursula Braun 提交于
      move RMBE data into user space buffer and update managing cursors
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      952310cc
    • U
      smc: send data (through RDMA) · e6727f39
      Ursula Braun 提交于
      copy data to kernel send buffer, and trigger RDMA write
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6727f39
    • U
      smc: connection data control (CDC) · 5f08318f
      Ursula Braun 提交于
      send and receive CDC messages (via IB message send and CQE)
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f08318f
    • U
      smc: link layer control (LLC) · 9bf9abea
      Ursula Braun 提交于
      send and receive LLC messages CONFIRM_LINK (via IB message send and CQE)
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bf9abea
    • U
      smc: initialize IB transport incl. PD, MR, QP, CQ, event, WR · bd4ad577
      Ursula Braun 提交于
      Prepare the link for RDMA transport:
      Create a queue pair (QP) and move it into the state Ready-To-Receive (RTR).
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd4ad577
    • U
      smc: work request (WR) base for use by LLC and CDC · f38ba179
      Ursula Braun 提交于
      The base containers for RDMA transport are work requests and completion
      queue entries processed through Infiniband verbs:
      * allocate and initialize these areas
      * map these areas to DMA
      * implement the basic communication consisting of work request posting
        and receival of completion queue events
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38ba179
    • U
      smc: remote memory buffers (RMBs) · cd6851f3
      Ursula Braun 提交于
      * allocate data RMB memory for sending and receiving
      * size depends on the maximum socket send and receive buffers
      * allocated RMBs are kept during life time of the owning link group
      * map the allocated RMBs to DMA
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd6851f3
    • U
      smc: connection and link group creation · 0cfdd8f9
      Ursula Braun 提交于
      * create smc_connection for SMC-sockets
      * determine suitable link group for a connection
      * create a new link group if necessary
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cfdd8f9
    • U
      smc: CLC handshake (incl. preparation steps) · a046d57d
      Ursula Braun 提交于
      * CLC (Connection Layer Control) handshake
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a046d57d
    • T
      smc: establish pnet table management · 6812baab
      Thomas Richter 提交于
      Connection creation with SMC-R starts through an internal
      TCP-connection. The Ethernet interface for this TCP-connection is not
      restricted to the Ethernet interface of a RoCE device. Any existing
      Ethernet interface belonging to the same physical net can be used, as
      long as there is a defined relation between the Ethernet interface and
      some RoCE devices. This relation is defined with the help of an
      identification string called "Physical Net ID" or short "pnet ID".
      Information about defined pnet IDs and their related Ethernet
      interfaces and RoCE devices is stored in the SMC-R pnet table.
      
      A pnet table entry consists of the identifying pnet ID and the
      associated network and IB device.
      This patch adds pnet table configuration support using the
      generic netlink message interface referring to network and IB device
      by their names. Commands exist to add, delete, and display pnet table
      entries, and to flush or display the entire pnet table.
      
      There are cross-checks to verify whether the ethernet interfaces
      or infiniband devices really exist in the system. If either device
      is not available, the pnet ID entry is not created.
      Loss of network devices and IB devices is also monitored;
      a pnet ID entry is removed when an associated network or
      IB device is removed.
      Signed-off-by: NThomas Richter <tmricht@linux.vnet.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6812baab
    • U
      smc: introduce SMC as an IB-client · a4cf0443
      Ursula Braun 提交于
      * create a list of SMC IB-devices
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4cf0443
    • U
      smc: establish new socket family · ac713874
      Ursula Braun 提交于
      * enable smc module loading and unloading
       * register new socket family
       * basic smc socket creation and deletion
       * use backing TCP socket to run CLC (Connection Layer Control)
         handshake of SMC protocol
       * Setup for infiniband traffic is implemented in follow-on patches.
         For now fallback to TCP socket is always used.
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Reviewed-by: NUtz Bacher <utz.bacher@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac713874
    • U
      net: introduce keepalive function in struct proto · 4b9d07a4
      Ursula Braun 提交于
      Direct call of tcp_set_keepalive() function from protocol-agnostic
      sock_setsockopt() function in net/core/sock.c violates network
      layering. And newly introduced protocol (SMC-R) will need its own
      keepalive function. Therefore, add "keepalive" function pointer
      to "struct proto", and call it from sock_setsockopt() via this pointer.
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Reviewed-by: NUtz Bacher <utz.bacher@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b9d07a4
    • D
      Merge branch 'sh_eth-wol' · c8584b3f
      David S. Miller 提交于
      Niklas Söderlund says:
      
      ====================
      sh_eth: add wake-on-lan support via magic packet
      
      This series adds support for Wake-on-Lan using Magic Packet for a few
      models of the sh_eth driver. Patch 1/6 fix a naming error, patch 2/6
      adds generic support to control and support WoL while patches 3/6 - 6/6
      enable different models.
      
      Based ontop of net-next master.
      
      Changes since v2.
      - Fix bookkeeping for "active_count" and "event_count" reported in
        /sys/kernel/debug/wakeup_sources. Thanks Geert for noticing this.
      - Add new patch 1/6 which corrects the name of ECMR_MPDE bit, suggested
        by Sergei.
      - s/sh7743/sh7734/ in patch 5/6. Thanks Geert for spotting this.
      - Spelling improvements suggested by Sergei and Geert.
      - Add Tested-by to 3/6 and 4/6.
      
      Changes since v1.
      - Split generic WoL functionality and device enablement to different
        patches.
      - Enable more devices then Gen2 after feedback from Geert and
        datasheets.
      - Do not set mdp->irq_enabled = false and remove specific MagicPacket
        interrupt clearing, instead let sh_eth_error() clear the interrupt as
        for other EMAC interrupts, thanks Sergei for the suggestion.
      - Use the original return logic in sh_eth_resume().
      - Moved sh_eth_private variable *clk to top of data structure  to avoid
        possible gaps due to alignment restrictions.
      - Make wol_enabled in sh_eth_private part of the already existing
        bitfield instead of a bool.
      - Do not initiate mdp->wol_enabled to 0, the struct is kzalloc'ed so
        it's already set to 0.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8584b3f
    • N
      sh_eth: enable wake-on-lan for sh7763 · 267e1d5c
      Niklas Söderlund 提交于
      This is based on public datasheet for sh7763 which shows it has the
      same behavior and registers for WoL as other versions of sh_eth.
      Signed-off-by: NNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      267e1d5c
    • N
      sh_eth: enable wake-on-lan for sh7734 · 159c2a90
      Niklas Söderlund 提交于
      This is based on public datasheet for sh7734 which shows it has the
      same behavior and registers for WoL as other versions of sh_eth.
      Signed-off-by: NNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      159c2a90
    • N
      sh_eth: enable wake-on-lan for r8a7740/armadillo · 33017e24
      Niklas Söderlund 提交于
      Geert Uytterhoeven reported WoL worked on his Armadillo board.
      Signed-off-by: NNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33017e24
    • N
      e410d86d
    • N
      sh_eth: add generic wake-on-lan support via magic packet · d8981d02
      Niklas Söderlund 提交于
      Add generic functionality to support Wake-on-LAN using MagicPacket which
      are supported by at least a few versions of sh_eth. Only add
      functionality for WoL, no specific sh_eth versions are marked to support
      WoL yet.
      
      WoL is enabled in the suspend callback by setting MagicPacket detection
      and disabling all interrupts expect MagicPacket. In the resume path the
      driver needs to reset the hardware to rearm the WoL logic, this prevents
      the driver from simply restoring the registers and to take advantage of
      that sh_eth was not suspended to reduce resume time. To reset the
      hardware the driver closes and reopens the device just like it would do
      in a normal suspend/resume scenario without WoL enabled, but it both
      closes and opens the device in the resume callback since the device
      needs to be open for WoL to work.
      
      One quirk needed for WoL is that the module clock needs to be prevented
      from being switched off by Runtime PM. To keep the clock alive the
      suspend callback need to call clk_enable() directly to increase the
      usage count of the clock. Then when Runtime PM decreases the clock usage
      count it won't reach 0 and be switched off.
      Signed-off-by: NNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d8981d02
    • N
      sh_eth: use correct name for ECMR_MPDE bit · 6dcf45e5
      Niklas Söderlund 提交于
      This bit was wrongly named due to a typo, Sergei checked the SH7734/63
      manuals and this bit should be named MPDE.
      Suggested-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: NNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dcf45e5
    • D
      Merge branch 'icmp-reply-optimize' · 9f2f27a9
      David S. Miller 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      net: optimize ICMP-reply code path
      
      This patchset is optimizing the ICMP-reply code path, for ICMP packets
      that gets rate limited. A remote party can easily trigger this code
      path by sending packets to port number with no listening service.
      
      Generally the patchset moves the sysctl_icmp_msgs_per_sec ratelimit
      checking to earlier in the code path and removes an allocation.
      
      Use-case: The specific case I experienced this being a bottleneck is,
      sending UDP packets to a port with no listener, which obviously result
      in kernel replying with ICMP Destination Unreachable (type:3), Port
      Unreachable (code:3), which cause the bottleneck.
      
       After Eric and Paolo optimized the UDP socket code, the kernels PPS
      processing capabilities is lower for no-listen ports, than normal UDP
      sockets.  This is bad for capacity planning when restarting a service.
      
      UDP no-listen benchmark 8xCPUs using pktgen_sample04_many_flows.sh:
       Baseline: 6.6 Mpps
       Patch:   14.7 Mpps
      Driver mlx5 at 50Gbit/s.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f2f27a9
    • J
      net: for rate-limited ICMP replies save one atomic operation · 7ba91ecb
      Jesper Dangaard Brouer 提交于
      It is possible to avoid the atomic operation in icmp{v6,}_xmit_lock,
      by checking the sysctl_icmp_msgs_per_sec ratelimit before these calls,
      as pointed out by Eric Dumazet, but the BH disabled state must be correct.
      
      The icmp_global_allow() call states it must be called with BH
      disabled.  This protection was given by the calls icmp_xmit_lock and
      icmpv6_xmit_lock.  Thus, split out local_bh_disable/enable from these
      functions and maintain it explicitly at callers.
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ba91ecb
    • J
      net: reduce cycles spend on ICMP replies that gets rate limited · c0303efe
      Jesper Dangaard Brouer 提交于
      This patch split the global and per (inet)peer ICMP-reply limiter
      code, and moves the global limit check to earlier in the packet
      processing path.  Thus, avoid spending cycles on ICMP replies that
      gets limited/suppressed anyhow.
      
      The global ICMP rate limiter icmp_global_allow() is a good solution,
      it just happens too late in the process.  The kernel goes through the
      full route lookup (return path) for the ICMP message, before taking
      the rate limit decision of not sending the ICMP reply.
      
      Details: The kernels global rate limiter for ICMP messages got added
      in commit 4cdf507d ("icmp: add a global rate limitation").  It is
      a token bucket limiter with a global lock.  It brilliantly avoids
      locking congestion by only updating when 20ms (HZ/50) were elapsed. It
      can then avoids taking lock when credit is exhausted (when under
      pressure) and time constraint for refill is not yet meet.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0303efe
    • J
      Revert "icmp: avoid allocating large struct on stack" · 8d9ba388
      Jesper Dangaard Brouer 提交于
      This reverts commit 9a99d4a5 ("icmp: avoid allocating large struct
      on stack"), because struct icmp_bxm no really a large struct, and
      allocating and free of this small 112 bytes hurts performance.
      
      Fixes: 9a99d4a5 ("icmp: avoid allocating large struct on stack")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d9ba388
    • D
      Merge tag 'rxrpc-rewrite-20170109' of... · aaa9c107
      David S. Miller 提交于
      Merge tag 'rxrpc-rewrite-20170109' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
      
      David Howells says:
      
      ====================
      afs: Refcount afs_call struct
      
      These patches provide some tracepoints for AFS and fix a potential leak by
      adding refcounting to the afs_call struct.
      
      The patches are:
      
       (1) Add some tracepoints for logging incoming calls and monitoring
           notifications from AF_RXRPC and data reception.
      
       (2) Get rid of afs_wait_mode as it didn't turn out to be as useful as
           initially expected.  It can be brought back later if needed.  This
           clears some stuff out that I don't then need to fix up in (4).
      
       (3) Allow listen(..., 0) to be used to disable listening.  This makes
           shutting down the AFS cache manager server in the kernel much easier
           and the accounting simpler as we can then be sure that (a) all
           preallocated afs_call structs are relesed and (b) no new incoming
           calls are going to be started.
      
           For the moment, listening cannot be reenabled.
      
       (4) Add refcounting to the afs_call struct to fix a potential multiple
           release detected by static checking and add a tracepoint to follow the
           lifecycle of afs_call objects.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaa9c107
    • D
      Merge branch 'dsa_swqitch_ops-const' · 73517885
      David S. Miller 提交于
      Florian Fainelli says:
      
      ====================
      net: dsa: Make dsa_switch_ops const
      
      This patch series allows us to annotate dsa_switch_ops with a const
      qualifier.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73517885
    • F
      net: dsa: Make dsa_switch_ops const · a82f67af
      Florian Fainelli 提交于
      Now that we have properly encapsulated and made drivers utilize exported
      functions, we can switch dsa_switch_ops to be a annotated with const.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a82f67af
    • F
      net: dsa: Encapsulate legacy switch drivers into dsa_switch_driver · ab3d408d
      Florian Fainelli 提交于
      In preparation for making struct dsa_switch_ops const, encapsulate it
      within a dsa_switch_driver which has a list pointer and a pointer to
      dsa_switch_ops. This allows us to take the list_head pointer out of
      dsa_switch_ops, which is written to by {un,}register_switch_driver.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab3d408d
    • F
      net: dsa: bcm_sf2: Declare our own dsa_switch_ops · 73095cb1
      Florian Fainelli 提交于
      Utilize the b53 exported functions to fill our bcm_sf2_ops structure,
      also making it clear what we utilize and what we specifically override.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73095cb1
    • F
      net: dsa: b53: Export most operations to other drivers · 3117455d
      Florian Fainelli 提交于
      In preparation for making dsa_switch_ops const, export b53 operations
      utilized by other drivers such as bcm_sf2.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3117455d
    • D
      Merge branch 'sh_eth-csum' · 1940075b
      David S. Miller 提交于
      Sergei Shtylyov says:
      
      ====================
      sh_eth: "intgelligent checksum" related cleanups
      
         Here's a set of 2 patches against DaveM's 'net.git' repo, as they are based
      on a couple patches merged there recently; however, the patches are destined
      for 'net-next.git' (once 'net.git' gets merged there next time). I'm cleaning
      up the "intelligent checksum" related code (however, the driver only disables
      this feature for now, theres's no proper offload supprt yet).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1940075b
    • S
      sh_eth: rename 'sh_eth_cpu_data::hw_crc' · 62e04b7e
      Sergei Shtylyov 提交于
      The 'struct sh_eth_cpu_data' field indicating the "intelligent checksum"
      support was misnamed 'hw_crc' -- rename it to 'hw_checksum'.
      Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62e04b7e
    • S
      sh_eth: get rid of 'sh_eth_cpu_data::shift_rd0' · 2e653ff0
      Sergei Shtylyov 提交于
      After checking all  the available manuals,  I have enough information to
      conclude  that the 'shift_rd0' flag is only relevant  for the Ether cores
      supporting so called "intelligent checksum" (and hence having CSMR) which
      is indicated  by the 'hw_crc' flag.  Since  all the relevant SoCs now have
      both these flags set, we can  at last  get  rid of the former flag...
      Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e653ff0
    • D
      bb1d3034