跟我一起学TCP/IP

723937936@qq.com · #46

Congestion Avoidance Algorithm

TCP判断网络发生拥塞是基于一个假设：出现丢包时就认为网络发生了拥塞
拥塞避免：当TCP发现丢包时采取的措施

出现拥塞的两个信号：
1. timeout
2. receipt of duplicate ACKs

拥塞避免的两种措施：
措施1：立即降低传输率，然后再快速增加传输率，当达到一个阈值（ssthresh）后，再切换到措施2
措施2：缓慢的增加传输率

具体执行哪种措施，是根据拥塞是否严重而定的：当拥塞严重时执行措施1，拥塞不严重时执行措施2

判断拥塞是否严重是根据出现拥塞的信号来确定的：
1. timeout 被认为是严重拥塞
2. receipt of duplicate ACKs 被认为拥塞不严重（重复的ACK是由接收端接收后续的segments触发的，由此可知不是那么拥堵）

措施1

当发生严重拥塞时，TCP执行措施1

如何实现立即降低传输率？
通过将cwnd重置为1来实现的

如何实现快速增加传输率？
通过慢启动算法完成的，慢启动算法使得cwnd指数增长

阈值（ssthresh）是多大？
当发生拥塞（无论是否严重）时，ssthresh设为当前窗口大小的一半，当前窗口大小是min(cwnd, advertised_wnd)，这里的cwnd是指重置为1之前的旧值
我感觉现实中如果对端通告的advertised_wnd比较小，可能就不会发生拥塞，更可能是因为cwnd比较大才造成拥塞
所以这里可以认为ssthresh设为cwnd的一半

当慢启动算法执行一段时候后，使得cwnd > ssthresh时，切换到措施2

措施2

我们知道慢启动算法，每收到一个ack，cwnd就增加一个segment
假如此刻cwnd是8个segments，那么TCP会快速注入8个segments到网络，也就是一个RTT后，可能最多会收到8个ACK，那么一个RTT后cwnd就增加到了16

措施2的目的是缓慢增加cwnd，一个RTT只增加1个segment（是慢启动算法增加量的1/8）
cwnd每经过一个RTT增加1个segment，cwnd的增加是线性的

723937936@qq.com · #47

Fast Retransmit and Fast Recovery Algorithms

TCP流控的不同模式：
1. slow start mode
2. fast recovery mode
3. congestion avoidance mode
4. maximum throughput mode

上面的模式划分是我个人的理解，仅供参考

上一个帖子里说的拥塞避免措施中包含了慢启动算法，造成了概念混淆，这里明确下：拥塞避免算法就是指TCP处于拥塞避免模式

Fast Retransmit Algorithm

TCP在收到每个乱序的segment时，会立即发送一个duplicate ACK，这个duplicate ACK告知对端自己期望的sequence number

Fast Retransmit Algorithm：TCP如果连续收到3个duplicate ACKs，那么TCP会立即重传该丢失的segment

我理解fast retransmit algorithm是fast recovery mode里的一个操作，如果这么理解的话，那么当TCP连续收到3个duplicate ACKs时，TCP会进入fast recovery mode

fast recovery mode

fast recovery mode的具体操作如下：

1. 当TCP收到第3个duplicate ACK时，设置ssthresh为当前有效窗口大小的一半，设置cwnd为ssthresh+3*MSS （参考书上的图21.7和图21.11的segment 62）
2. 重传丢失的segment
3. 之后每次收到一个duplicate ACK时，设置cwnd=cwnd+MSS （参考书上的图21.7和图21.11的segment 64、65、66、68、70）
4. 当收到第2步重传的那个segment的ACK时，设置cwnd=ssthresh，因为此时cwnd=ssthresh，且收到的是new data的ACK（非duplicate ACK），所以更新cwnd=cwnd+MSS，然后退出fast recovery mode，进入congestion avoidance mode （参考书上的图21.7和图21.11的segment 72）

说明：
第1步里的当前有效窗口大小指的是min(cwnd, addvertised_wnd)
第3步的继续增cwnd的值的理由是，对端依然可以继续接受segment，说明网络不那么拥塞
第4步退出fast recovery mode前将cwnd重置为ssthresh（第1步设置的值ssthresh）的理由是，因为当前网络拥塞不是很严重，没必要把cwnd降的太低

congestion avoidance mode

当cwnd>ssthresh时，当接收到new data ACK时，cwnd的更新公式，如下：

代码：全选

cwnd <- cwnd + segsize*segsize/cwnd + segsize/8

上面这个公式是4.3BSD和4.4BSD使用的公式，作者已经指出，这个公式不符合标准，但是作者为了得到与实现计算结果一致的值，这里还是使用这个公式

723937936@qq.com · #48

ICMP Errors

观察ICMP host unreachable错误

首先把linux主机设为路由器，前面说过linux作为路由器，在转发ip数据报时，如果查不到route，默认不会回送ICMP host unreachable错误消息
在路由表加一条拒绝的route，才会回送ICMP host unreachable错误消息

代码：全选

linux $ sudo bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'                 # 配置为路由器
linux $ sudo route add -net 192.168.2.0 netmask 255.255.255.0 metric 1024 reject       # 添加显式拒绝去往网络192.168.2.0的route
linux $ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.0.1     0.0.0.0         UG    100    0        0 enp0s3
169.254.0.0     0.0.0.0         255.255.0.0     U     1000   0        0 enp0s3
192.168.0.0     0.0.0.0         255.255.255.0   U     100    0        0 enp0s3
192.168.2.0     -               255.255.255.0   !     1024   -        0 -

先把macos的默认路由配置为linux，然后发起连接

代码：全选

macos $ sock 192.168.2.3 8888
connect() error: Network is unreachable

上面建立连接超时了，报告的错误是Network is unreachable，而不是Operation timed out

下面是在linux主机上的抓包

代码：全选

20:51:25.505567 IP linux > macos: ICMP host 192.168.2.3 unreachable, length 72
20:51:26.508039 IP linux > macos: ICMP host 192.168.2.3 unreachable, length 72
20:51:27.509243 IP linux > macos: ICMP host 192.168.2.3 unreachable, length 72
20:51:28.510379 IP linux > macos: ICMP host 192.168.2.3 unreachable, length 72
20:51:32.513780 IP linux > macos: ICMP host 192.168.2.3 unreachable, length 72

从上面的tcpdump输出，显示了路由器回送了5个ICMP host unreachable错误消息
虽然报告的ICMP错误是host unreachable，而不是network unreachable，但是macos返回给应用进程的errno却是ENETUNREACH
应用进程应该把ENETUNREACH和EHOSTUNREACH视为一样的错误

Repacketization

在macos上发起连接

代码：全选

macos $ sock linux 8888
hello there
line number 2                # 输入这行前拔掉linux主机的网线或用iptables丢弃掉目的端口为8888的segments
and 3

代码：全选

 9.103102 ( 9.102827) IP macos.52015 > linux.8888: flags [PA], seq 3271102335:3271102347, ack 812886231, win 2058, options [ts 1352883248 3324909990], length 12
 9.103129 ( 0.000027) IP linux.8888 > macos.52015: flags [A], seq 812886231:812886231, ack 3271102347, win 509, options [ts 3324919093 1352883248], length 0

// 这里断开连接

21.337522 (12.234393) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352895468 3324919093], length 14
21.488395 ( 0.150873) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352895618 3324919093], length 14
21.789470 ( 0.301075) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352895918 3324919093], length 14
22.190727 ( 0.401257) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352896319 3324919093], length 14
22.791443 ( 0.600716) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352896919 3324919093], length 14
23.793449 ( 1.002006) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352897919 3324919093], length 14
25.596108 ( 1.802659) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352899719 3324919093], length 14
28.999289 ( 3.403181) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352903119 3324919093], length 14
35.623246 ( 6.623957) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352909719 3324919093], length 14
42.227430 ( 6.604184) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102361, ack 812886231, win 2058, options [ts 1352916319 3324919093], length 14

// 这里恢复连接

48.837282 ( 6.609852) IP macos.52015 > linux.8888: flags [PA], seq 3271102347:3271102367, ack 812886231, win 2058, options [ts 1352922920 3324919093], length 20
48.837307 ( 0.000025) IP linux.8888 > macos.52015: flags [A], seq 812886231:812886231, ack 3271102367, win 509, options [ts 3324958827 1352922920], length 0

前两行对应"hello there\n"共12个字节
中间的10行对应"line number 2\n"共14个字节
最后两行表明，TCP将"line number 2\n"和"and 3\n"共20个字节重新打包一起发送了

另外macos的RTO似乎并不是指数退避算法，后面的三行重传RTO都是6.6s

723937936@qq.com · #49

第22章：TCP Persist Timer

如果接收端通告的窗口大小为0，那么接收端就不能发送数据了，为了防止接收端发送的窗口更新segment丢失，发送端会向接收端查询窗口大小，TCP通过启动一个称为Persist Timer的定时器来定时发送查询，这里Persist的含义是持之以恒、持久的意思，也就是说发送端永远不会放弃查询，直到接收端通告的窗口不为0为止

观察window probes

代码：全选

 0.000000 ( 0.000000) IP linux.47154 > macos.5555: flags [S], seq 133017518:133017518, win 64240, options [mss 1460,ts 3354223538 0,ws 7], length 0
 0.000489 ( 0.000489) IP macos.5555 > linux.47154: flags [SA], seq 83378274:83378274, ack 133017519, win 33304, options [mss 1460,ws 3,ts 1408617785 3354223538], length 0
 0.000544 ( 0.000055) IP linux.47154 > macos.5555: flags [A], seq 133017519:133017519, ack 83378275, win 502, options [ts 3354223538 1408617785], length 0
 
 0.000661 ( 0.000117) IP linux.47154 > macos.5555: flags [PA], seq 133017519:133018543, ack 83378275, win 502, options [ts 3354223538 1408617785], length 1024
 0.000706 ( 0.000045) IP linux.47154 > macos.5555: flags [A], seq 133018543:133019991, ack 83378275, win 502, options [ts 3354223538 1408617785], length 1448
 0.000765 ( 0.000059) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133018543, win 4035, options [ts 1408617785 3354223538], length 0
 0.000783 ( 0.000018) IP linux.47154 > macos.5555: flags [PA], seq 133019991:133020591, ack 83378275, win 502, options [ts 3354223538 1408617785], length 600
 0.000796 ( 0.000013) IP linux.47154 > macos.5555: flags [A], seq 133020591:133022039, ack 83378275, win 502, options [ts 3354223538 1408617785], length 1448
 0.000806 ( 0.000010) IP linux.47154 > macos.5555: flags [PA], seq 133022039:133024935, ack 83378275, win 502, options [ts 3354223538 1408617785], length 2896
 0.000859 ( 0.000053) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133020591, win 5536, options [ts 1408617785 3354223538], length 0
 0.000867 ( 0.000008) IP linux.47154 > macos.5555: flags [PA], seq 133024935:133032175, ack 83378275, win 502, options [ts 3354223538 1408617785], length 7240
 0.000875 ( 0.000008) IP linux.47154 > macos.5555: flags [PA], seq 133032175:133032879, ack 83378275, win 502, options [ts 3354223539 1408617785], length 704
 0.000905 ( 0.000030) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133023487, win 5174, options [ts 1408617785 3354223538], length 0
 0.000905 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133024935, win 4993, options [ts 1408617785 3354223538], length 0
 0.000957 ( 0.000052) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133027831, win 4631, options [ts 1408617785 3354223538], length 0
 0.000957 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133030727, win 4269, options [ts 1408617785 3354223538], length 0
 0.000957 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133032175, win 4088, options [ts 1408617785 3354223538], length 0
 0.000957 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133032879, win 4000, options [ts 1408617785 3354223539], length 0
 0.000977 ( 0.000020) IP linux.47154 > macos.5555: flags [PA], seq 133032879:133033903, ack 83378275, win 502, options [ts 3354223539 1408617785], length 1024
 0.000987 ( 0.000010) IP linux.47154 > macos.5555: flags [A], seq 133033903:133035351, ack 83378275, win 502, options [ts 3354223539 1408617785], length 1448
 0.001069 ( 0.000082) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133033903, win 3872, options [ts 1408617785 3354223539], length 0
 0.001083 ( 0.000014) IP linux.47154 > macos.5555: flags [PA], seq 133035351:133035951, ack 83378275, win 502, options [ts 3354223539 1408617785], length 600
 0.001098 ( 0.000015) IP linux.47154 > macos.5555: flags [A], seq 133035951:133037399, ack 83378275, win 502, options [ts 3354223539 1408617785], length 1448
 0.001105 ( 0.000007) IP linux.47154 > macos.5555: flags [PA], seq 133037399:133040295, ack 83378275, win 502, options [ts 3354223539 1408617785], length 2896
 0.001114 ( 0.000009) IP linux.47154 > macos.5555: flags [PA], seq 133040295:133047535, ack 83378275, win 502, options [ts 3354223539 1408617785], length 7240
 0.001129 ( 0.000015) IP linux.47154 > macos.5555: flags [PA], seq 133047535:133062015, ack 83378275, win 502, options [ts 3354223539 1408617785], length 14480
 0.002829 ( 0.001700) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133035951, win 3616, options [ts 1408617785 3354223539], length 0
 0.002829 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133038847, win 3254, options [ts 1408617786 3354223539], length 0
 0.002829 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133040295, win 3073, options [ts 1408617786 3354223539], length 0
 0.002829 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133043191, win 2711, options [ts 1408617786 3354223539], length 0
 0.002829 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133046087, win 2349, options [ts 1408617786 3354223539], length 0
 0.002829 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133047535, win 2168, options [ts 1408617786 3354223539], length 0
 0.002829 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133050431, win 1806, options [ts 1408617787 3354223539], length 0
 0.002829 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133053327, win 1444, options [ts 1408617787 3354223539], length 0
 0.002850 ( 0.000021) IP linux.47154 > macos.5555: flags [PA], seq 133062015:133064879, ack 83378275, win 502, options [ts 3354223540 1408617785], length 2864
 0.002906 ( 0.000056) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133056223, win 1082, options [ts 1408617787 3354223539], length 0
 0.002906 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133059119, win 720, options [ts 1408617787 3354223539], length 0
 0.002906 ( 0.000000) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133062015, win 358, options [ts 1408617787 3354223539], length 0
 0.003667 ( 0.000761) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408617787 3354223540], length 0
 // 下面是window probes
 0.211649 ( 0.207982) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354223749 1408617787], length 0
 0.212492 ( 0.000843) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408617993 3354223540], length 0
 0.653414 ( 0.440922) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354224191 1408617993], length 0
 0.653753 ( 0.000339) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408618432 3354223540], length 0
 1.484679 ( 0.830926) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354225022 1408618432], length 0
 1.484958 ( 0.000279) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408619260 3354223540], length 0
 3.139554 ( 1.654596) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354226677 1408619260], length 0
 3.140043 ( 0.000489) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408620912 3354223540], length 0
 6.446848 ( 3.306805) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354229984 1408620912], length 0
 6.447465 ( 0.000617) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408624210 3354223540], length 0
13.091919 ( 6.644454) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354236630 1408624210], length 0
13.092627 ( 0.000708) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408630835 3354223540], length 0
26.405423 (13.312796) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354249943 1408630835], length 0
26.405857 ( 0.000434) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408644125 3354223540], length 0
53.539589 (27.133732) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354277077 1408644125], length 0
53.540193 ( 0.000604) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408671224 3354223540], length 0
106.787819 (53.247626) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354330325 1408671224], length 0
106.788233 ( 0.000414) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408724333 3354223540], length 0
213.283748 (106.495515) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354436821 1408724333], length 0
213.283997 ( 0.000249) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408830564 3354223540], length 0
334.116210 (120.832213) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354557654 1408830564], length 0
334.116433 ( 0.000223) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1408951116 3354223540], length 0
454.947503 (120.831070) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354678485 1408951116], length 0
454.947731 ( 0.000228) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1409071707 3354223540], length 0
575.783983 (120.836252) IP linux.47154 > macos.5555: flags [A], seq 133064878:133064878, ack 83378275, win 502, options [ts 3354799322 1409071707], length 0
575.784589 ( 0.000606) IP macos.5555 > linux.47154: flags [A], seq 83378275:83378275, ack 133064879, win 0, options [ts 1409192339 3354223540], length 0

从上面的输出看，Persist Timer的时间间隔遵循指数退避算法：0.2、0.4、0.8、1.6 ... 120、120、120 ...

723937936@qq.com · #50

Silly Window Syndrome

Silly Window Syndrome：傻逼窗口综合征指的是发送端每次发送非常小的segment，原因有二：
1. 接收端每次通告一个很小的窗口
2. 发送端应用每次write很小的数据

为了避免Silly Window Syndrome，发送端和接收端都采取了一定措施

接收端采取的措施：
1. 接收端TCP模块绝不会通告小的窗口（0除外），小是指小于min(MSS, receive_buffer_size / 2)，这里的MSS应该是指接收端通告给发送端的MSS
2. 如果接收端上一次通告的窗口大小为X，当发送端发送一个大小为Y的segment，如果X-Y>0，那么不论X-Y值多小，发送端都必须通告该值（参考图22.3的segment 13）

发送端是否可以输出segment，由下列条件决定：
1. 如果send buffer里数据量大于等于MSS，则可以输出
2. 如果send buffer里数据量大于等于接收端曾经通告的最大窗口的一半，则可以输出
3. 如果启用了Nagle algorithm，只要没有未确认的segment，就可以输出
4. 如果禁用了Nagle algorithm，则可以输出

第3点和第4点，似乎还与接收端通告的窗口大小有关，如果对端通告的窗口小于min(MSS, receive_buffer_size / 2)，则也不会立即发送，而是要等Persist timer超时才发送（参考图22.3的segment 14）

书上最后说FIN_WAIT2状态没有设置定时器，那是作者用的sun系统没有设置定时器，前面学习过，linux系统是会设置定时器的

书上图22.3的segment 16和segment 17，还有segment 20表明：即使receive buffer的可用大小超过MSS，接收端也不会主动发送窗口更新，只有发送端发送window probe才通告可用的窗口大小。
只有receive buffer的可用大小超过receive_buffer_size/2时，接收端才主动发送窗口更新

723937936@qq.com · #51

第23章：TCP Keepalive Timer

学习完这一章的结论是不要使用TCP Keepalive Timer，而是在应用层使用心跳

723937936@qq.com · #52

第24章：TCP Futures and Performance

Path MTU Discovery

Path MTU是路径上最小的MTU，路径MTU发现技术的原理是：发送的IP数据报的IP header里DF flag置位，如果中间的某个router在转发该IP数据报时，发现出口MTU小于该IP数据报的大小，则会丢弃该IP数据报，然后向发送者回送一个ICMP can't fragment error，该ICMP错误消息里携带了router的外出接口的MTU，接收端收到该ICMP错误消息后，使用ICMP消息里携带的MTU，重传一个合适大小的IP数据报，直到IP数据报到达目的地为止

前面第18个回帖介绍了linux上控制MTU发现机制的socket option。

Long Fat Pipes

capacity(bits) = bandwidth(bits/sec) * round-trip time(sec)

连接的容量也称为bandwidth-delay product，前面第43个回帖举了一个例子来理解bandwidth-delay product的概念

connection也称为pipe，这里的pipe是个一般概念，并不是指pipe系统调用

long fat networks：简称LFNs，指的是bandwidth-delay product比较大的网络，多大叫大？姑且认为大于65535 bytes吧，因为tcp header里的window size字段只有16位，最大值是65535，也就是说百兆以太网就属于LFN了

long fat pipe：在LFN上建立的连接称为long fat pipe

在LFN上运行tcp，无法充分利用LFNs的大容量，有如下问题：

1. 在LFN上，tcp header里的window size字段（16位），已经不够用，window scale option用来解决该问题
2. 在LFN上，如果在传输一个窗口数据的过程中有多个packets丢失，会导致整个pipeline里的数据被清空，大幅度降低吞吐量，SACK用来解决该问题
3. 在LFN上，由于窗口非常大，如果一个窗口只采样一次RTT，则这个RTT误差就很大，可能导致不必要的重传，timestamp option用来解决该问题
4. 在千兆LFN上，tcp header里的sequence number字段（32位），已经不够用，timestamp option用来解决该问题（4字节的timestamp相当于扩展了sequence number）

千兆网络里时延和带宽的关系

delay：时延，一个bit从一端传输到另一端所需的时间，由光速限制，是个固定值，无法优化
bandwidth：带宽，单位时间内可以注入到网络里的bit数，比如千兆网络的的带宽是1,000,000,000bits/sec

问题是，是否带宽越大，传输一定数据量所需的时间就越短？答案是肯定的，但是当带宽超过千兆时，增加带宽对传输时间的影响已经很小了，比如你花了两倍的钱将带宽提高到2,000,000,000bits/sec，传输时间可能只节省了10%，就很不划算了

Window Scale Option

Window Scale Option格式：

使用了Window Scale Option的窗口大小是：window_size << shift_count
shift_count的取值范围为0-14

那么最大窗口是65535 << 14 = 1073725440 bytes，比1GB小点，65536 << 14 为1GB

Window Scale Option只能出现在SYN segment里

shift count是TCP模块根据receive buffer的大小自动计算的，应用程序通过套接字选项SO_RCVBUF设置receive buffer的大小，从而间接的指定了shift count的大小，SO_RCVBUF选项必须在调用connect函数或accept函数之前设置，因为Window Scale Option是在SYN segment里携带的，连接建立后再修改receive buffer的大小也无法告知对端了

观察Window Scale Option

代码：全选

linux $ sock -v -R128000 macos 8888
SO_RCVBUF = 256000
connected on 192.168.0.6.58246 to 192.168.0.2.8888
TCP_MAXSEG = 1448

linux会double通过SO_RCVBUF选项设置的buffer size（参考man 7 socket）

代码：全选

 0.000000 ( 0.000000) IP linux.58246 > macos.8888: flags [S], seq 2182126623:2182126623, win 65535, options [mss 1460,ts 1476741476 0,ws 1], length 0
 0.000352 ( 0.000352) IP macos.8888 > linux.58246: flags [SA], seq 1916769072:1916769072, ack 2182126624, win 65535, options [mss 1460,ws 6,ts 1506657680 1476741476], length 0
 0.000374 ( 0.000022) IP linux.58246 > macos.8888: flags [A], seq 2182126624:2182126624, ack 1916769073, win 32768, options [ts 1476741476 1506657680], length 0
 0.000480 ( 0.000106) IP macos.8888 > linux.58246: flags [A], seq 1916769073:1916769073, ack 2182126624, win 2058, options [ts 1506657680 1476741476], length 0

linux在建立连接时，通告的window size的值为65535，window scale的值为1，即通告的窗口大小为65535 << 1 = 131070，该值小于应用设置receive buffer的大小（SO_RCVBUF = 256000），如果window scale的值为2的话，65535 << 2 = 262140，则超过了应用设置receive buffer的大小（SO_RCVBUF = 256000）

书上的例子，4.3BSD通告的窗口大小都大于应用设置的receive buffer的大小，有点奇怪，可能是4.3BSD内核对receive buffer的大小进行了某种round操作

Timestamp Option

时间戳选项格式：

时间戳选项存在于每个segment中，发送端填充timestamp value字段，接收端在发送确认段时将该值填充到timestamp echo reply字段，这样发送端每收到一个确认段都可以计算一个RTT值，不但提高了采样频率也提高了RTT的计算精度。

时间戳的值是单调递增的，每隔一段时间就增加1，RFC 1323推荐的时间间隔是1ms-1second之间，4.3BSD的具体实现是每500ms增加1

TCP并不是每收到一个data segment，就发送一个ack segment，可能一个ack segment确认多个data segment，那么这个ack segment里timestamp echo reply字段对应的是哪个data segment呢？

TCP为每个连接维护一个tsrecent变量和一个lastack变量：
tsrecent变量就是TCP在发送ack segment时填充到timestam echo reply字段的值
lastack变量是TCP期望收到的下一个data segment里第一个字节的序号
当TCP下次收到的data segment就是他期望的data segment时，就更新tsrecent变量为该data segment里的timestamp value，否则不更新tsrecent变量

由此可见上面问题的答案是：ack segment里timestam echo reply字段填充的是要确认的第一个data segment里的timestamp value

PAWS：Protection Against Wrapped Sequence Numbers

tcp header里的Sequence number字段是32位的，也就是4G大小，每发送4G字节的数据就会发生wrap

前面学习过ip数据报的最大生命期是MSL（Maximum Segment Lifetime），如果在这个时间内一个丢失的segment又重新出现，且它的Sequence number处于当前正在传输的窗口中，这个时候timestamp value的值较小的那个segment会被接收端丢弃，如果没有timestamp option的帮助，接收端会认为接受到了一个乱序的segment，并将其保存在receive buffer里

MSL一般是120s，而在一个千兆网络中，发送4G大小的数据大约需要34s，所以是可能发生上述的情况的

TCP Performance

TCP数据传输最大吞吐率计算（以百兆以太网为例，且忽略ACK）：

一个full-sized data segment里携带的最大数据量是1460字节，再加上TCP header、IP header、Ethernet frame header等开销，一共1538字节（见图24.9）

那么实际的数据吞吐率为：

throughput = (1460 / 1538) * (100,000,000 / 8) = 11,866,059 bytes/sec

上述计算的是理论值，大概每秒11M多，现实中不可能达到理论值，一些现实中的限制如下：

1. 在一个路径中可能存在较慢的链路
2. 机器的内存带宽也是一个限制（比如将数据从用户空间拷贝到内核空间的吞吐率，对于现代机器，内存带宽不太可能成为限制）
3. 对端通告的窗口大小，以及RTT

关于第2点，可用下面的命令进行简单测试（从内核空间拷贝10GB数据到用户空间）:

代码：全选

linux $ time dd if=/dev/zero of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 0.52927 s, 19.8 GB/s

real	0m0.530s
user	0m0.013s
sys	0m0.517s

上述测试显示内存拷贝的带宽为19.8GB/s

关于第3点，因为：

bandwidth-delay product = bandwidth * RTT

bandwidth-delay product即为接收端的receive buffer，也即：接收端通告的窗口大小
所以bandwidth = advertised_win_size / RTT

假如：
advertised_win_size取最大值，即：65535 << 14 = 1,073,725,440
RTT为20ms

那么：
bandwidth = 1,073,725,440 / 0.02s = 53,686,272,000 bytes/sec，大约50GB/sec

所以TCP协议的window size也不大可能成为TCP吞吐率的限制

带宽实测

iperf3工具可以用来测试网络带宽

ubuntu上安装iperf3

代码：全选

linux $ sudo apt install iperf3

在ubuntu18.04上执行iperf3 -s

代码：全选

linux $ iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 192.168.0.5, port 49506
[  5] local 192.168.0.6 port 5201 connected to 192.168.0.5 port 49520
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-1.00   sec   700 KBytes  5.73 Mbits/sec
[  5]   1.00-2.00   sec   897 KBytes  7.34 Mbits/sec
[  5]   2.00-3.00   sec   803 KBytes  6.57 Mbits/sec
[  5]   3.00-4.00   sec   723 KBytes  5.92 Mbits/sec
[  5]   4.00-5.00   sec   817 KBytes  6.70 Mbits/sec
[  5]   5.00-6.00   sec   963 KBytes  7.88 Mbits/sec
[  5]   6.00-7.00   sec   981 KBytes  8.04 Mbits/sec
[  5]   7.00-8.00   sec   833 KBytes  6.82 Mbits/sec
[  5]   8.00-9.00   sec   734 KBytes  6.01 Mbits/sec
[  5]   9.00-10.00  sec   826 KBytes  6.77 Mbits/sec
[  5]  10.00-10.09  sec   107 KBytes  9.95 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.09  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-10.09  sec  8.19 MBytes  6.81 Mbits/sec                  receiver
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

在windows wsl2上执行iperf3 -c 192.168.0.6

代码：全选

$ $ iperf3 -c 192.168.0.6
Connecting to host 192.168.0.6, port 5201
[  5] local 172.29.65.74 port 40662 connected to 192.168.0.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.05 MBytes  8.80 Mbits/sec    0   67.9 KBytes
[  5]   1.00-2.00   sec   851 KBytes  6.97 Mbits/sec    2   83.4 KBytes
[  5]   2.00-3.00   sec   912 KBytes  7.47 Mbits/sec    0   91.9 KBytes
[  5]   3.00-4.00   sec   912 KBytes  7.47 Mbits/sec    2   96.2 KBytes
[  5]   4.00-5.00   sec   730 KBytes  5.98 Mbits/sec    1    102 KBytes
[  5]   5.00-6.00   sec   912 KBytes  7.47 Mbits/sec    1    107 KBytes
[  5]   6.00-7.00   sec  1.13 MBytes  9.47 Mbits/sec    0    113 KBytes
[  5]   7.00-8.00   sec   547 KBytes  4.48 Mbits/sec    2    117 KBytes
[  5]   8.00-9.00   sec   851 KBytes  6.97 Mbits/sec    0    123 KBytes
[  5]   9.00-10.00  sec   973 KBytes  7.97 Mbits/sec    1    126 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  8.71 MBytes  7.31 Mbits/sec    9             sender
[  5]   0.00-10.00  sec  8.19 MBytes  6.87 Mbits/sec                  receiver

iperf Done.

跟我一起学TCP/IP

Re: 跟我一起学TCP/IP

Re: 跟我一起学TCP/IP

Re: 跟我一起学TCP/IP

Re: 跟我一起学TCP/IP

Re: 跟我一起学TCP/IP

Re: 跟我一起学TCP/IP

Re: 跟我一起学TCP/IP