• L
    bpf: add support for BPF_SOCK_OPS_BASE_RTT · e6546ef6
    Lawrence Brakmo 提交于
    A congestion control algorithm can make a call to the BPF socket_ops
    program to request the base RTT. The base RTT can be congestion control
    dependent and is meant to represent a congestion threshold such that
    RTTs above it indicate congestion. This is especially useful for flows
    within a DC where the base RTT is easy to obtain.
    
    Being provided a base RTT solves a basic problem in RTT based congestion
    avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
    to get the base RTT when the network is not congested, it is very
    diffcult to do when it is very congested. Newer connections get an
    inflated value of the base RTT leading to unfariness (newer flows with a
    larger base RTT get more bandwidth). As a result, RTT based congestion
    avoidance algorithms tend to update their base RTTs to improve fairness.
    In very congested networks this can lead to base RTT inflation, reducing
    the ability of these RTT based congestion control algorithms to prevent
    congestion.
    
    Note that in my experiments with TCP-NV, the base RTT provided can be
    much larger than the actual hardware RTT. For example, experimenting
    with hosts within a rack where the hardware RTT is 16-20us, I've used
    base RTTs up to 150us. The effect of using a larger base RTT is that the
    congestion avoidance algorithm will allow more queueing. When there are
    only a few flows the main effect is larger measured RTTs and RPC
    latencies due to the increased queueing. When there are a lot of flows,
    a larger base RTT can lead to more congestion and more packet drops.
    For this case, where the hardware RTT is 20us, a base RTT of 80us
    produces good results.
    
    This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
    set adds support for using it in TCP-NV. Further study and testing is
    needed before support can be added to other delay based congestion
    avoidance algorithms.
    Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
    Acked-by: NAlexei Starovoitov <ast@fb.com>
    Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: NDavid S. Miller <davem@davemloft.net>
    e6546ef6
bpf.h 29.5 KB