关于TCP_CORK的一个细节

国庆长假第二天,研究拥塞控制的绝佳时机。我暂时没钱去非洲观测角马斑马在狮子和鳄鱼虎视眈眈的注目下迁徙,但我可以在家门口观测更壮观的…好久没有写点TCP的东西了,只是看着国庆大堵车,喝着啤酒,就想起了TCP,无假期,不TCP,那就整点儿呗…


很多人都知道TCP的Nagle算法,但知道TCP_CORK的就相对比较少了,一句话,TCP_CORK可以认为是Nagle的增强。和Nagle隐式地不发小包不同,TCP_CORK是显式地阻止小包发送,这个从其名字上也能看得出来,只要用户态没有显式地拔掉塞子,最后遗留的不足MSS的数据包将始终发不出去!

  这只是设计者的想法,但现实中真的是这样吗?万一编程者忘了拔塞子怎么办?如何来容错?(不能由于编程错误而造成协议层面上的诡异行为,TCP应该是鲁棒的。)


很多文章都说了,TCP协议栈的实现会等待200ms的时间,期间如果没人把塞子拔掉,就把遗留的哪怕不足一个MSS的数据包无条件发送出去,这确实增加了系统的鲁棒性,但是这200ms的时间差从何说起呢?为什么是200ms呢?

  事实上,Linux的TCP_CORK实现中根本就没有200ms一说,所谓200ms只是说TCP连接的最小RTO是200ms,而TCP_CORK的超时发送时间正是一个RTO-而不是200ms!

  如何证实呢?


我承认我不喜欢perf stack track那种套路,在我看来,简单的事情上动用perf,那一点都不方便,反而带来了很大的时间成本,但我并非在鼓励大家自己造轮子,我只是在形而上的意义上不喜欢这种繁复的洛可可风格而已,我喜欢自己动手,短平快!所以我选择了基于最基本的tcp_probe范式自己写jprobe来跟踪stack。


为了证实为什么TCP_CORK的超时发送间隔是RTO这件事,我写了下面的packetdrill脚本:

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792 <mss 1000, sackOK, nop, nop, nop, wscale 7>

+0 > S. 0:0(0) ack 1 <...>

+.1 < . 1:1(0) ack 1 win 257

+0 accept(3, ..., ...) = 4
+0 setsockopt(4, IPPROTO_TCP, TCP_NODELAY, [0], 4) = 0
// 设置TCP_CORK,加塞子
+0 setsockopt(4, IPPROTO_TCP, TCP_CORK, [1], 4) = 0

// 开始发送满MSS数据
+0 write(4, ..., 1000) = 1000
// 将ACK延迟,旨在让RTO变大
+0.350 < . 1:1(0) ack 1001 win 10000
+0 write(4, ..., 1000) = 1000
+0.350 < . 1:1(0) ack 2001 win 10000

// 以下打印rto的值
+0 %{ print tcpi_rto }%
+0 %{ print tcpi_rtt }%

// 这里是关键,发送一个长度只有10字节的小包,由于CORK的阻滞,它一定会延迟发送
+0 write(4, ..., 10) = 10
+0.40 < . 1:1(0) ack 2011 win 10000

// 延时观察
+2.960 write(4, ..., 10000) = 10000

跑一下脚本,抓包看最后一个10字节的包发送时间戳和前面时间戳之差,就会发现它恰哈就等于packetdrill脚本打印出来的RTO,这里由于我马上要出发汕头,就不贴图了。

  RTO大概是560ms的时候,你会发现CORK的超时发送间隔远超200ms,并不是固定的200ms!

  代码里没有任何秘密。
  我们看一下tcp的几个定时器,分别是:

#define ICSK_TIME_RETRANS 1 /* Retransmit timer */
#define ICSK_TIME_DACK 2 /* Delayed ack timer */
#define ICSK_TIME_PROBE0 3 /* Zero window probe timer */
#define ICSK_TIME_EARLY_RETRANS 4 /* Early retransmit timer */
#define ICSK_TIME_LOSS_PROBE 5 /* Tail loss probe timer */

那么,CORK定时器会是哪个呢?在和同事讨论这个问题的时候,我隐约觉得之前碰到过这个问题,确实也碰到过,于是我搜索了我的博客文章:
UDP_CORK,TCP_CORK以及TCP_NODELAY
2010年的事了,谁会记得那么久远的技术细节,好在当时写下了些东西…

  这篇文章提到了ICSK_TIME_PROBE0正是延时发送被TCP_CORK阻滞数据的定时器。其中是tcp_write_wakeup这个函数进行了实际的发送。为了探究这个定时器的超时时间,我写了下面的probe代码:

void jsk_reset_timer(struct sock *sk, struct timer_list* timer,
                                    unsigned long expires)
{
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_sock *inet = inet_sk(sk);

        if (ntohs(inet->inet_dport) == port || ntohs(inet->inet_sport) == port) {
                struct inet_connection_sock *icsk = inet_csk(sk);
                if (&icsk->icsk_retransmit_timer == timer) {
                        printk("#####:%d %d %d %d\n", icsk->icsk_pending, jiffies_to_msecs(tcp_probe0_when2(sk, (unsigned)(120*HZ))), tcp_probe0_base2(sk), icsk->icsk_timeout);
                        printk("#####:%d %d %d %d\n", jiffies_to_msecs(TCP_RTO_MIN), TCP_RTO_MIN, TCP_RTO_MAX, HZ);
                        if (icsk->icsk_pending == ICSK_TIME_PROBE0/*也就是数值3*/)
                                dump_stack();
                }
        }
        jprobe_return();
}

static struct jprobe tcp_jprobe = {
        .kp = {
                .symbol_name    = "sk_reset_timer",
        },
        .entry  = jsk_reset_timer,
};

通过stack可以看出是在__tcp_push_pending_frames这个函数中设置的定时器,即:

void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
                   int nonagle)
{
    /* If we are closed, the bytes will have to remain here. * In time closedown will finish, we empty the write queue and * all will be happy. */
    if (unlikely(sk->sk_state == TCP_CLOSE))
        return;
    // xmit函数会返回True,因为TCP_CORK阻滞了发送,具体看tcp_nagle_test->tcp_nagle_check
    if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
               sk_gfp_atomic(sk, GFP_ATOMIC)))
        // 于是设置探测定时器
        tcp_check_probe_timer(sk);
}

最后看一下这个probe探测的设置逻辑,也是很简单的:

static inline void tcp_check_probe_timer(struct sock *sk)
{
    const struct tcp_sock *tp = tcp_sk(sk);
    const struct inet_connection_sock *icsk = inet_csk(sk);

    // 这里的条件完全符合Nagle/Cork的语义
    if (!tp->packets_out && !icsk->icsk_pending)
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                      // 注意,超时时间是RTO
                      icsk->icsk_rto, TCP_RTO_MAX);
}

说实话,Cork的超时发送使用PROBE0这个名字,确实有点词不达意,但这风格我们早就习惯了…


最后,我们来看一个更加细微的细节,那就是最小RTO相关的细节。

  我们知道,RTO基于RTT来计算,而这里的RTT实际上是采集到的实时RTT的移动指数平均平滑值,也就是说,历史的RTT值在平滑值中会有一定的份额,那么可想而知,即便是在本机到本机的这种超高速总线环境,一开始的RTT也并不是实际值,而是预设的经验值,为了让RTT区域达到真实值,就需要让指数平均多移动一会儿,为此则必须多发送些数据:

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792 <mss 1000, sackOK, nop, nop, nop, wscale 7>

+0 > S. 0:0(0) ack 1 <...>

+.1 < . 1:1(0) ack 1 win 257

+0 accept(3, ..., ...) = 4
+0 setsockopt(4, IPPROTO_TCP, TCP_NODELAY, [0], 4) = 0
+0 setsockopt(4, IPPROTO_TCP, TCP_CORK, [1], 4) = 0

+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 1001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 2001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 3001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 4001 win 10000

+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 4001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 5001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 6001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 7001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 8001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 9001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 10001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 11001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 12001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 13001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 14001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 15001 win 10000

+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 16001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 17001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 18001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 19001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 20001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 21001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 22001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 23001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 24001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 25001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 26001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 27001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 28001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 29001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 30001 win 10000

+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 31001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 32001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 33001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 34001 win 10000
+0 write(4, ..., 1000) = 1000
+0.0 < . 1:1(0) ack 35001 win 10000
+0 %{ print tcpi_rto }%
+0 %{ print tcpi_rtt }%
// 以上之所以发送那么多数据,只是让RTT平稳!因为在握手期间,RTT是猜的,越多的数据传输,RTT就越准确,从而RTO也就越合理。

// 这里是关键,发送一个长度只有10字节的小包
+0 write(4, ..., 10) = 10
+0.40 < . 1:1(0) ack 35011 win 10000

+2.960 write(4, ..., 10000) = 10000

这个时候,还是用probe程序,我们打印最小RTO的值:

void jsk_reset_timer(struct sock *sk, struct timer_list* timer,
                                    unsigned long expires)
{
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_sock *inet = inet_sk(sk);

        if (ntohs(inet->inet_dport) == port || ntohs(inet->inet_sport) == port) {
                struct inet_connection_sock *icsk = inet_csk(sk);
                if (&icsk->icsk_retransmit_timer == timer) {
                        printk("#####:%d %d %d %d\n", icsk->icsk_pending, jiffies_to_msecs(tcp_probe0_when2(sk, (unsigned)(120*HZ))), tcp_probe0_base2(sk), icsk->icsk_timeout);
                        printk("#####:%d %d %d %d\n", jiffies_to_msecs(TCP_RTO_MIN), TCP_RTO_MIN, TCP_RTO_MAX, HZ);
                        if (icsk->icsk_pending == ICSK_TIME_PROBE0/*也就是数值3*/)
                                dump_stack();
                }
        }
        jprobe_return();
}

发现它就是最小的大于200ms的一个值,这里的200ms由下面的宏来定义:

#define TCP_RTO_MIN ((unsigned)(HZ/5))

取决于HZ。请注意,这里的TCP_RTO_MIN的单位并不是指ms,而是时钟滴答,换算成ms,需要下面的操作:

unsigned int jiffies_to_msecs(const unsigned long j)
{
#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
    return (MSEC_PER_SEC / HZ) * j;
#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
    return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
#else
# if BITS_PER_LONG == 32
    return (HZ_TO_MSEC_MUL32 * j) >> HZ_TO_MSEC_SHR32;
# else
    return (j * HZ_TO_MSEC_NUM) / HZ_TO_MSEC_DEN;
# endif
#endif
}

很乱,但最终把TCP_RTO_MIN代入后,所得的值就是(200+$一个滴答的毫秒数)ms,对于HZ250而言,他就是204ms。为什么是200ms而不是400ms或者20ms呢?我想这又是一个当年的经验值,根据当时的网络数值统计而得出的,类似MSL这种吧。有段注释挺有意思:

/* Something is really bad, we could not queue an additional packet,
* because qdisc is full or receiver sent a 0 window.
* We do not want to add fuel to the fire, or abort too early,
* so make sure the timer we arm now is at least 200ms in the future,
* regardless of current icsk_rto value (as it could be ~2ms)
*/
static inline unsigned long tcp_probe0_base(const struct sock *sk)
{
return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN);
}
为什么要延迟200ms?因为不能火上浇油

  非常有寓意,非常之深刻,不要火上浇油,希望国庆假期期间的司机们能理解这段注释,每逢假期这可是我研究拥塞控制的绝佳机会…

  有意思,有意思。

相关文章
相关标签/搜索