1. 18 10月, 2017 7 次提交
  2. 17 10月, 2017 15 次提交
  3. 15 10月, 2017 4 次提交
    • C
      tcp: add a tracepoint for tcp retransmission · e086101b
      Cong Wang 提交于
      We need a real-time notification for tcp retransmission
      for monitoring.
      
      Of course we could use ftrace to dynamically instrument this
      kernel function too, however we can't retrieve the connection
      information at the same time, for example perf-tools [1] reads
      /proc/net/tcp for socket details, which is slow when we have
      a lots of connections.
      
      Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
      and exposes src/dst IP addresses and ports of the connection.
      This also makes it easier to integrate into perf.
      
      Note, I expose both IPv4 and IPv6 addresses at the same time:
      for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
      for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
      Also, add sk and skb pointers as they are useful for BPF.
      
      1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NBrendan Gregg <bgregg@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e086101b
    • C
      net_sched: fix a compile warning in act_ife · 65787594
      Cong Wang 提交于
      Apparently ife_meta_id2name() is only called when
      CONFIG_MODULES is defined.
      
      This fixes:
      
      net/sched/act_ife.c:251:20: warning: ‘ife_meta_id2name’ defined but not used [-Wunused-function]
       static const char *ife_meta_id2name(u32 metaid)
                          ^~~~~~~~~~~~~~~~
      
      Fixes: d3f24ba8 ("net sched actions: fix module auto-loading")
      Cc: Roman Mashak <mrv@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65787594
    • V
      net: dsa: remove .set_addr · 841f4f24
      Vivien Didelot 提交于
      Now that there is no user for the .set_addr function, remove it from
      DSA. If a switch supports this feature (like mv88e6xxx), the
      implementation can be done in the driver setup.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      841f4f24
    • M
      icmp: don't fail on fragment reassembly time exceeded · 258bbb1b
      Matteo Croce 提交于
      The ICMP implementation currently replies to an ICMP time exceeded message
      (type 11) with an ICMP host unreachable message (type 3, code 1).
      
      However, time exceeded messages can either represent "time to live exceeded
      in transit" (code 0) or "fragment reassembly time exceeded" (code 1).
      
      Unconditionally replying to "fragment reassembly time exceeded" with
      host unreachable messages might cause unjustified connection resets
      which are now easily triggered as UFO has been removed, because, in turn,
      sending large buffers triggers IP fragmentation.
      
      The issue can be easily reproduced by running a lot of UDP streams
      which is likely to trigger IP fragmentation:
      
        # start netserver in the test namespace
        ip netns add test
        ip netns exec test netserver
      
        # create a VETH pair
        ip link add name veth0 type veth peer name veth0 netns test
        ip link set veth0 up
        ip -n test link set veth0 up
      
        for i in $(seq 20 29); do
            # assign addresses to both ends
            ip addr add dev veth0 192.168.$i.1/24
            ip -n test addr add dev veth0 192.168.$i.2/24
      
            # start the traffic
            netperf -L 192.168.$i.1 -H 192.168.$i.2 -t UDP_STREAM -l 0 &
        done
      
        # wait
        send_data: data send error: No route to host (errno 113)
        netperf: send_omni: send_data failed: No route to host
      
      We need to differentiate instead: if fragment reassembly time exceeded
      is reported, we need to silently drop the packet,
      if time to live exceeded is reported, maintain the current behaviour.
      In both cases increment the related error count "icmpInTimeExcds".
      
      While at it, fix a typo in a comment, and convert the if statement
      into a switch to mate it more readable.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      258bbb1b
  4. 14 10月, 2017 1 次提交
    • A
      mqprio: Introduce new hardware offload mode and shaper in mqprio · 4e8b86c0
      Amritha Nambiar 提交于
      The offload types currently supported in mqprio are 0 (no offload) and
      1 (offload only TCs) by setting these values for the 'hw' option. If
      offloads are supported by setting the 'hw' option to 1, the default
      offload mode is 'dcb' where only the TC values are offloaded to the
      device. This patch introduces a new hardware offload mode called
      'channel' with 'hw' set to 1 in mqprio which makes full use of the
      mqprio options, the TCs, the queue configurations and the QoS parameters
      for the TCs. This is achieved through a new netlink attribute for the
      'mode' option which takes values such as 'dcb' (default) and 'channel'.
      The 'channel' mode also supports QoS attributes for traffic class such as
      minimum and maximum values for bandwidth rate limits.
      
      This patch enables configuring additional HW shaper attributes associated
      with a traffic class. Currently the shaper for bandwidth rate limiting is
      supported which takes options such as minimum and maximum bandwidth rates
      and are offloaded to the hardware in the 'channel' mode. The min and max
      limits for bandwidth rates are provided by the user along with the TCs
      and the queue configurations when creating the mqprio qdisc. The interface
      can be extended to support new HW shapers in future through the 'shaper'
      attribute.
      
      Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
      mqprio queue options and use this to be shared between the kernel and
      device driver. This contains a copy of the existing data structure
      for mqprio queue options. This new data structure can be extended when
      adding new attributes for traffic class such as mode, shaper, shaper
      parameters (bandwidth rate limits). The existing data structure for mqprio
      queue options will be shared between the kernel and userspace.
      
      Example:
        queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
        min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
      
      To dump the bandwidth rates:
      
      qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
                   queues:(0:3) (4:7)
                   mode:channel
                   shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit
      Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e8b86c0
  5. 13 10月, 2017 13 次提交
    • J
      tipc: add multipoint-to-point flow control · 04d7b574
      Jon Maloy 提交于
      We already have point-to-multipoint flow control within a group. But
      we even need the opposite; -a scheme which can handle that potentially
      hundreds of sources may try to send messages to the same destination
      simultaneously without causing buffer overflow at the recipient. This
      commit adds such a mechanism.
      
      The algorithm works as follows:
      
      - When a member detects a new, joining member, it initially set its
        state to JOINED and advertises a minimum window to the new member.
        This window is chosen so that the new member can send exactly one
        maximum sized message, or several smaller ones, to the recipient
        before it must stop and wait for an additional advertisement. This
        minimum window ADV_IDLE is set to 65 1kB blocks.
      
      - When a member receives the first data message from a JOINED member,
        it changes the state of the latter to ACTIVE, and advertises a larger
        window ADV_ACTIVE = 12 x ADV_IDLE blocks to the sender, so it can
        continue sending with minimal disturbances to the data flow.
      
      - The active members are kept in a dedicated linked list. Each time a
        message is received from an active member, it will be moved to the
        tail of that list. This way, we keep a record of which members have
        been most (tail) and least (head) recently active.
      
      - There is a maximum number (16) of permitted simultaneous active
        senders per receiver. When this limit is reached, the receiver will
        not advertise anything immediately to a new sender, but instead put
        it in a PENDING state, and add it to a corresponding queue. At the
        same time, it will pick the least recently active member, send it an
        advertisement RECLAIM message, and set this member to state
        RECLAIMING.
      
      - The reclaimee member has to respond with a REMIT message, meaning that
        it goes back to a send window of ADV_IDLE, and returns its unused
        advertised blocks beyond that value to the reclaiming member.
      
      - When the reclaiming member receives the REMIT message, it unlinks
        the reclaimee from its active list, resets its state to JOINED, and
        notes that it is now back at ADV_IDLE advertised blocks to that
        member. If there are still unread data messages sent out by
        reclaimee before the REMIT, the member goes into an intermediate
        state REMITTED, where it stays until the said messages have been
        consumed.
      
      - The returned advertised blocks can now be re-advertised to the
        pending member, which is now set to state ACTIVE and added to
        the active member list.
      
      - To be proactive, i.e., to minimize the risk that any member will
        end up in the pending queue, we start reclaiming resources already
        when the number of active members exceeds 3/4 of the permitted
        maximum.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04d7b574
    • J
      tipc: guarantee delivery of last broadcast before DOWN event · a3bada70
      Jon Maloy 提交于
      The following scenario is possible:
      - A user sends a broadcast message, and thereafter immediately leaves
        the group.
      - The LEAVE message, following a different path than the broadcast,
        arrives ahead of the broadcast, and the sending member is removed
        from the receiver's list.
      - The broadcast message arrives, but is dropped because the sender
        now is unknown to the receipient.
      
      We fix this by sequence numbering membership events, just like ordinary
      unicast messages. Currently, when a JOIN is sent to a peer, it contains
      a synchronization point, - the sequence number of the next sent
      broadcast, in order to give the receiver a start synchronization point.
      We now let even LEAVE messages contain such an "end synchronization"
      point, so that the recipient can delay the removal of the sending member
      until it knows that all messages have been received.
      
      The received synchronization points are added as sequence numbers to the
      generated membership events, making it possible to handle them almost
      the same way as regular unicasts in the receiving filter function. In
      particular, a DOWN event with a too high sequence number will be kept
      in the reordering queue until the missing broadcast(s) arrive and have
      been delivered.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3bada70
    • J
      tipc: guarantee delivery of UP event before first broadcast · 399574d4
      Jon Maloy 提交于
      The following scenario is possible:
      - A user joins a group, and immediately sends out a broadcast message
        to its members.
      - The broadcast message, following a different data path than the
        initial JOIN message sent out during the joining procedure, arrives
        to a receiver before the latter..
      - The receiver drops the message, since it is not ready to accept any
        messages until the JOIN has arrived.
      
      We avoid this by treating group protocol JOIN messages like unicast
      messages.
      - We let them pass through the recipient's multicast input queue, just
        like ordinary unicasts.
      - We force the first following broadacst to be sent as replicated
        unicast and being acknowledged by the recipient before accepting
        any more broadcast transmissions.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      399574d4
    • J
      tipc: guarantee that group broadcast doesn't bypass group unicast · 2f487712
      Jon Maloy 提交于
      We need a mechanism guaranteeing that group unicasts sent out from a
      socket are not bypassed by later sent broadcasts from the same socket.
      We do this as follows:
      
      - Each time a unicast is sent, we set a the broadcast method for the
        socket to "replicast" and "mandatory". This forces the first
        subsequent broadcast message to follow the same network and data path
        as the preceding unicast to a destination, hence preventing it from
        overtaking the latter.
      
      - In order to make the 'same data path' statement above true, we let
        group unicasts pass through the multicast link input queue, instead
        of as previously through the unicast link input queue.
      
      - In the first broadcast following a unicast, we set a new header flag,
        requiring all recipients to immediately acknowledge its reception.
      
      - During the period before all the expected acknowledges are received,
        the socket refuses to accept any more broadcast attempts, i.e., by
        blocking or returning EAGAIN. This period should typically not be
        longer than a few microseconds.
      
      - When all acknowledges have been received, the sending socket will
        open up for subsequent broadcasts, this time giving the link layer
        freedom to itself select the best transmission method.
      
      - The forced and/or abrupt transmission method changes described above
        may lead to broadcasts arriving out of order to the recipients. We
        remedy this by introducing code that checks and if necessary
        re-orders such messages at the receiving end.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f487712
    • J
      tipc: guarantee group unicast doesn't bypass group broadcast · b87a5ea3
      Jon Maloy 提交于
      Group unicast messages don't follow the same path as broadcast messages,
      and there is a high risk that unicasts sent from a socket might bypass
      previously sent broadcasts from the same socket.
      
      We fix this by letting all unicast messages carry the sequence number of
      the next sent broadcast from the same node, but without updating this
      number at the receiver. This way, a receiver can check and if necessary
      re-order such messages before they are added to the socket receive buffer.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87a5ea3
    • J
      tipc: introduce group multicast messaging · 5b8dddb6
      Jon Maloy 提交于
      The previously introduced message transport to all group members is
      based on the tipc multicast service, but is logically a broadcast
      service within the group, and that is what we call it.
      
      We now add functionality for sending messages to all group members
      having a certain identity. Correspondingly, we call this feature 'group
      multicast'. The service is using unicast when only one destination is
      found, otherwise it will use the bearer broadcast service to transfer
      the messages. In the latter case, the receiving members filter arriving
      messages by looking at the intended destination instance. If there is
      no match, the message will be dropped, while still being considered
      received and read as seen by the flow control mechanism.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b8dddb6
    • J
      tipc: introduce group anycast messaging · ee106d7f
      Jon Maloy 提交于
      In this commit, we make it possible to send connectionless unicast
      messages to any member corresponding to the given member identity,
      when there is more than one such member. The sender must use a
      TIPC_ADDR_NAME address to achieve this effect.
      
      We also perform load balancing between the destinations, i.e., we
      primarily select one which has advertised sufficient send window
      to not cause a block/EAGAIN delay, if any. This mechanism is
      overlayed on the always present round-robin selection.
      
      Anycast messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee106d7f
    • J
      tipc: introduce group unicast messaging · 27bd9ec0
      Jon Maloy 提交于
      We now make it possible to send connectionless unicast messages
      within a communication group. To send a message, the sender can use
      either a direct port address, aka port identity, or an indirect port
      name to be looked up.
      
      This type of messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27bd9ec0
    • J
      tipc: introduce flow control for group broadcast messages · b7d42635
      Jon Maloy 提交于
      We introduce an end-to-end flow control mechanism for group broadcast
      messages. This ensures that no messages are ever lost because of
      destination receive buffer overflow, with minimal impact on performance.
      For now, the algorithm is based on the assumption that there is only one
      active transmitter at any moment in time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7d42635
    • J
      tipc: receive group membership events via member socket · ae236fb2
      Jon Maloy 提交于
      Like with any other service, group members' availability can be
      subscribed for by connecting to be topology server. However, because
      the events arrive via a different socket than the member socket, there
      is a real risk that membership events my arrive out of synch with the
      actual JOIN/LEAVE action. I.e., it is possible to receive the first
      messages from a new member before the corresponding JOIN event arrives,
      just as it is possible to receive the last messages from a leaving
      member after the LEAVE event has already been received.
      
      Since each member socket is internally also subscribing for membership
      events, we now fix this problem by passing those events on to the user
      via the member socket. We leverage the already present member synch-
      ronization protocol to guarantee correct message/event order. An event
      is delivered to the user as an empty message where the two source
      addresses identify the new/lost member. Furthermore, we set the MSG_OOB
      bit in the message flags to mark it as an event. If the event is an
      indication about a member loss we also set the MSG_EOR bit, so it can
      be distinguished from a member addition event.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae236fb2
    • J
      tipc: add second source address to recvmsg()/recvfrom() · 31c82a2d
      Jon Maloy 提交于
      With group communication, it becomes important for a message receiver to
      identify not only from which socket (identfied by a node:port tuple) the
      message was sent, but also the logical identity (type:instance) of the
      sending member.
      
      We fix this by adding a second instance of struct sockaddr_tipc to the
      source address area when a message is read. The extra address struct
      is filled in with data found in the received message header (type,) and
      in the local member representation struct (instance.)
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31c82a2d
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
    • J
      tipc: improve destination linked list · a80ae530
      Jon Maloy 提交于
      We often see a need for a linked list of destination identities,
      sometimes containing a port number, sometimes a node identity, and
      sometimes both. The currently defined struct u32_list is not generic
      enough to cover all cases, so we extend it to contain two u32 integers
      and rename it to struct tipc_dest_list.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a80ae530