• Y
    TencentOS-kernel: ipvs: avoid drop first packet by reusing conntrack · 173a8024
    YangYuxi 提交于
    fix #29256237
    
    commit a01a9445c00eca3e37523eb6b0d87f494eceeb4b TencentOS-kernel
    
    Since 'commit f719e375 ("ipvs: drop first packet to
    redirect conntrack")', when a new TCP connection meet
    the conditions that need reschedule, the first syn packet
    is dropped, this cause one second latency for the new
    connection, more discussion about this problem can easy
    search from google, such as:
    
    1)One second connection delay in masque
    https://marc.info/?t=151683118100004&r=1&w=2
    
    2)IPVS low throughput #70747
    https://github.com/kubernetes/kubernetes/issues/70747
    
    3)Apache Bench can fill up ipvs service proxy in seconds #544
    https://github.com/cloudnativelabs/kube-router/issues/544
    
    4)Additional 1s latency in `host -> service IP -> pod`
    https://github.com/kubernetes/kubernetes/issues/90854
    
    5)kube-proxy ipvs conn_reuse_mode setting causes errors
    with high load from single client
    https://github.com/kubernetes/kubernetes/issues/81775
    
    The root cause is when the old session is expired, the
    conntrack related to the session is dropped by
    ip_vs_conn_drop_conntrack. The code is as follows:
    ```
    static void ip_vs_conn_expire(struct timer_list *t)
    {
    ...
    
         if ((cp->flags & IP_VS_CONN_F_NFCT) &&
             !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
                 /* Do not access conntracks during subsys cleanup
                  * because nf_conntrack_find_get can not be used after
                  * conntrack cleanup for the net.
                  */
                 smp_rmb();
                 if (ipvs->enable)
                         ip_vs_conn_drop_conntrack(cp);
         }
    ...
    }
    ```
    As shown in the code, only when condition (cp->flags & IP_VS_CONN_F_NFCT)
    is true, the function ip_vs_conn_drop_conntrack will be called.
    
    So we optimize this by following steps (Administrators
    can choose the following optimization by setting
    net.ipv4.vs.conn_reuse_old_conntrack=1):
    1) erase the IP_VS_CONN_F_NFCT flag (it is safely because
       no packets will use the old session)
    2) call ip_vs_conn_expire_now to release the old session,
       then the related conntrack will not be dropped
    3) then ipvs unnecessary to drop the first syn packet, it
       just continue to pass the syn packet to the next process,
       create a new ipvs session, and the new session will related
       to the old conntrack(which is reopened by conntrack as a new
       one), the next whole things is just as normal as that the old
       session isn't used to exist.
    
    The above processing has no problems except for passive FTP,
    for passive FTP situation, ipvs can judging from
    condition (atomic_read(&cp->n_control)) and condition (cp->control).
    So, for other conditions(means not FTP), ipvs should give users
    the right to choose,they can choose a high performance one processing
    logical by setting net.ipv4.vs.conn_reuse_old_conntrack=1. It is necessary
    because most business scenarios (such as kubernetes) are very sensitive
    to TCP short connection latency.
    
    This patch has been verified on our thousands of kubernets
    node servers on Tencent Inc.
    Signed-off-by: NYangYuxi <yx.atom1@gmail.com>
    [Tony: add the missing sysctl knob and disable it by default]
    Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
    Acked-by: NDust Li <dust.li@linux.alibaba.com>
    173a8024
ipvs-sysctl.txt 11.9 KB