Linux and the strange case of the TCP black holes
On Internet, some sites might be mysteriously unreachable while the server hosting the site is alive and well. What is the cause of these mysterious black holes.
At the beginning of a TCP connexion, the 2 systems negotiate the maximum segment size (MSS), which is the maximum size of the data segments they will exchange. This negotiation does not take into account the path between the 2 systems, which might require a smaller packet size.
From this MSS, the Maximum Transmission Unit (MTU) is deduced by adding the size of the IP and TCP headers (40 bytes total). The MTU is the maximum size of outgoing packets.
MTU = MSS + 40
If a packet is too big for one of the intermediate routers, the router will send an ICMP message to the sender telling it to adjust the packet size1.
Unfortunately, these ICMP packets will sometimes be dropped by paranoid firewalls. This is where things start getting bad. The router drop a packet and send an ICMP message. The message is dropped by the firewall. The packet is sent again by the original sender, and gets drop again by the router. Packets disappear without trace and nobody gets a notification of what is happening. This is an endless black hole.
A solution to this problem is described by RFC 4821. The sender starts by a small packet size, and will send probes (larger packets) to discover the optimal packet size. If the probe is not acknowledged, but the rest of the packets are fine, the sender system will deduce it is over the MSS, and send smaller probes. If the probe is acknowledged, the larger packet size will be used for the following packets.
This mechanism is available in Linux since version 2.6.17. It can be
activated by setting
net.ipv4.tcp_mtu_probing
to 1 (use when a black hole is detected) or 2 (always use). This first
implementation was unfortunately not very subtle.
When we start probing, we force the maximum segment size (MSS) to a
default minimal size (tcp_base_mss
2). If the packet is not
successful, the packet size is divided by 2. If the probe succeed, the
next probe will have twice its segment size (MSS).
With net.ipv4.tcp_mtu_probing
set to 1, the MTU probing only starts
after a black hole has been detected. For example, with a negotiated MSS
of 1460, the sender detects a black hole. The kernel reduce the segment
size to 512 (works OK), the probes 1024 (works OK). The next probe would
be 2048, but it’s over the negotiated MSS. Probing is over, the sender
will use a MSS of 1024. This, even if the optimal size was a MSS of 1280.
This is not great, but that’s an acceptable compromise. It will avoid a black hole, but will give less optimal MSS size.
This mechanism can be set to be always used (tcp_mtu_probing=2
). This
will permanently reduce the MSS to 1024… That’s not a great idea.
With Linux 4.1, this mechanism has been seriously
improved.
Instead of doubling the size, this improved version use a binary search
to find the optimal size. With a binary search, the probes will quickly
converge to the best MSS for the connexion. Moreover, the base MSS
(tcp_base_mss
) is now 1024.
What are the best settings?
Always enable MTU probing in case of black hole (
tcp_mtu_probing=1
). This will prevent black hole, and have no impact on the other TCP connexions.For kernels before 4.1, use 1024 as base MSS (
tcp_base_mss=1024
)
This mechanism is called Path MTU Discovery (PMTUD) and is defined by RFC 1191. This mechanism is activated by default in the Linux kernel. ↩︎
The value of
tcp_base_mss
is 512 by default. This value is 1024 as of kernel 4.1. To find the corresponding MTU, you need to add the size of the IP and TCP headers. The MTU value corresponding to a MSS of 1024 is 1064 (1024 (data) + 20 (IP header) + 20 (TCP header)). On older kernels, you will also need to add the size of the TCP options. ↩︎