Bug 41 - TCP transfers can stall with smaller socket buffer sizes
Summary: TCP transfers can stall with smaller socket buffer sizes
Status: RESOLVED FIXED
Alias: None
Product: passt
Classification: Unclassified
Component: TCP (show other bugs)
Version: unspecified
Hardware: All Linux
: Normal normal
Assignee: David Gibson
URL:
Depends on:
Blocks:
 
Reported: 2023-01-24 03:55 UTC by David Gibson
Modified: 2023-03-08 21:51 UTC (History)
1 user (show)

See Also:


Attachments

Description David Gibson 2023-01-24 03:55:41 UTC
If the passt testsuite is run with default socket buffer limits (net.core.[rw]mem_max = 212992) then in the passt_in_ns section, the IPv6 host to guest large transfer stalls semi frequently (on about 30% of attempts for me).  The problem goes away if the socket buffer limits are changed to 16MiB, which is usually done for testing since that's needed for best performance.

We might not get optimal performance, but we shouldn't outright stall with default buffer sizes.  The exact trigger conditions aren't clear yet, but it seems quite plausible this could cause random stalls on real world TCP transfers.
Comment 1 David Gibson 2023-01-24 04:24:20 UTC
Adding --debug options shows many partial writes to the qemu socket, meaning we're effectively losing packets.  I think what's happening is that we're filling up the socket buffer, so we're no longer reliably able to send all our pending packets to the qemu socket without blocking.  So, we drop packets.  Then because we keep resending stuff, the socket buffer is never able to substantially empty so we just keep losing packets.

This suggests there are two problems here:

1) We're not detecting send errors, and just dropping packets.  This is a pretty inefficient way to signal failures.

2) We're not letting the buffer empty.  At least at first glance, dropping frames because the socket buffer is full should work similarly to dropping frames due to congestion, which TCP is designed for.  If we're slowing down transmission in response, as TCP should, why isn't that allowing the buffer to drain?
Comment 2 David Gibson 2023-01-25 05:59:16 UTC
Ok, I think I've located the problem, at least for #2 above.

There are two places we process acks from tap:

A) in tcp_tap_handler()

    if (th->ack) {
        conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
        conn->retrans = 0;
    }

B) in tcp_data_from_tap()

    if (ack) {
        if (max_ack_seq == conn->seq_to_tap) {
            conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
            conn->retrans = 0;
        }

        tcp_sock_consume(conn, max_ack_seq);
    }

(A) looks bogus to me: we're clearing ACK_FROM_TAP_DUE on *any* ack from the tap, even if it doesn't ackowledge *all* the unacked data we've already sent to the guest.

If we're still receiving data from the socket side, this doesn't make a lot of difference, because that will keep setting ACK_FROM_TAP_DUE.  However, once the data from the socket is all buffered, if we get a partial send, the guest will ack the data it actually got, we clear ACK_FROM_TAP_DUE at (A) and we never retransmit the part it didn't get.

If I comment out the conn_flag() call at (A) I can no longer reproduce the problem.  I'm not sure if that's a completely correct fix though.  I'm pretty sure we *do* need to reset retrans to 0 here (we get a reset due to too many retransmits if that's removed).  We possibily only want to do that if we're making forward progress, though (i.e. if this ack has a greater sequence than conn->seq_ack_from_tap).
Comment 3 Stefano Brivio 2023-03-08 21:51:06 UTC
Fixed by commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer").

Note You need to log in before you can comment on or make changes to this bug.