Bug 74 - qemu TCP iperf3 test slows down to 0bytes/second
Summary: qemu TCP iperf3 test slows down to 0bytes/second
Status: RESOLVED FIXED
Alias: None
Product: passt
Classification: Unclassified
Component: TCP (show other bugs)
Version: unspecified
Hardware: All Linux
: Normal normal
Assignee: David Gibson
URL:
Depends on:
Blocks:
 
Reported: 2023-08-28 16:21 UTC by Matej Hrica
Modified: 2023-11-10 00:43 UTC (History)
2 users (show)

See Also:


Attachments
log produced by passt (8.09 MB, text/plain)
2023-08-28 16:21 UTC, Matej Hrica
Details
passt1.pcap (361.52 KB, application/vnd.tcpdump.pcap)
2023-08-29 12:45 UTC, Matej Hrica
Details

Description Matej Hrica 2023-08-28 16:21:16 UTC
Created attachment 22 [details]
log produced by passt

I am using iperf3 inside qemu connected through passt, TCP networking can sometimes slow down to 0 bytes/sec.
This slowdown is specific to the connection and seems to be related to TCP window scaling. When the connection slows down to 0bytes/sec it never recovers again, but when you create another connection that one may works fine.

# Info: 
I am using the latest passt from master branch (commit a7e4bfb857cb). The host and guest are both Fedora Workstation 38. I am using qemu from Fedora repository. Note, that I ran into this issue while adding support for virtio-net using passt into libkrun (https://www.github.com/containers/libkrun), so the issue should not be specific to qemu, but just in case, the qemu version is qemu-7.2.4-2.fc38. 

# Steps to reproduce:
1. run passt as: `./passt -f --stderr --trace -4 -t 5201 -u 5201`
2. run qemu as: `qemu-kvm -cdrom ~/Downloads/Fedora-Workstation-Live-x86_64-38-1.6.iso -smp 4 -m 4096 -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket`
   (get the iso here: https://download.fedoraproject.org/pub/fedora/linux/releases/38/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-38-1.6.iso)
3. install iperf3 inside the guest: `sudo dnf install iperf3`
   Note that the download sometimes slows down to 0B/s because of this bug - you can try multiple times until it succeeds or use the workaround.
4. Run `iperf3 -s` on the guest
5. Run `iperf3 -c 127.0.0.1` on the host
   If the speed reports around 150-170Mbit/s for at least 2 seconds, the connection will probably work for a long duration until slowing down. When the connection is working fine you can cancel ^C the test and try again multiple times, until you get a connection that slows down in the first few seconds. When the connection slows down it doesn't recover again.


# Workaround:
Disabling TCP window scaling inside the guest solves the issue. It also sometimes massively improves TCP performance, but only for one next iperf test, subsequent iperf tests are slow though - around 10.5 Mbits/sec, sometimes dropping to 0bytes/sec, but recovering and not staying at 0. 
To do that, you can use: `sudo sysctl -w net.ipv4.tcp_window_scaling=0`


# Description of the attached passt log:
All tests are with net.ipv4.tcp_window_scaling enabled.
passt log timestamps before 161.8822 are installing iperf3 using dnf


timestamps before 279.7843 is an iperf test:

$ iperf3 -t 1000 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 60000 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  16.2 MBytes   136 Mbits/sec    4   1.75 MBytes       
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
[  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
^C[  5]  12.00-12.81  sec  0.00 Bytes  0.00 bits/sec    0   1.75 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-12.81  sec  16.2 MBytes  10.6 Mbits/sec    4             sender
[  5]   0.00-12.81  sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated




timestamps before 332.2857 is an iperf test:

$ iperf3 -t 1000 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 48098 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.50 MBytes  20.9 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
^C[  5]   7.00-7.49   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-7.49   sec  2.50 MBytes  2.80 Mbits/sec    0             sender
[  5]   0.00-7.49   sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated



timestamps before 420.9430 is an iperf test:

$ iperf3 -t 1000 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 56828 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  27.5 MBytes   231 Mbits/sec    0   2.06 MBytes       
[  5]   1.00-2.00   sec  8.75 MBytes  73.4 Mbits/sec    3   2.06 MBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   2.06 MBytes       
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   2.06 MBytes       
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   2.06 MBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   2.06 MBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   2.06 MBytes       
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   2.06 MBytes       
^C- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-8.06   sec  36.2 MBytes  37.7 Mbits/sec    3             sender
[  5]   0.00-8.06   sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated
Comment 1 Stefano Brivio 2023-08-28 17:26:25 UTC
Hi Matej,

(In reply to Matej Hrica from comment #0)
> # Info: 
> I am using the latest passt from master branch (commit a7e4bfb857cb). The
> host and guest are both Fedora Workstation 38. I am using qemu from Fedora
> repository. Note, that I ran into this issue while adding support for
> virtio-net using passt into libkrun
> (https://www.github.com/containers/libkrun), so the issue should not be
> specific to qemu, but just in case, the qemu version is qemu-7.2.4-2.fc38. 

No known issues with that version -- there was an issue which would be sort of compatible with what you're describing, but it was fixed here:

$ git describe 7550a82259fcf9ce5f1f6443ced779d0eb8afdca
v7.1.0-1255-g7550a82259

and also the guest kernel should be recent enough, last fix I'm aware of for an issue of that sort:

$ git describe d71ebe8114b4bf622804b810f5e274069060a174
v6.2-rc3-223-gd71ebe8114b4

> # Steps to reproduce:
> 1. run passt as: `./passt -f --stderr --trace -4 -t 5201 -u 5201`

I guess you tried already, but... does this happen also without --trace? That's very verbose so a very low throughput with occasional interruptions is actually expected.

Could you try to take a capture (instead) with --pcap? That should show a bit more descriptively what happens just before the connection stalls.

> # Workaround:
> Disabling TCP window scaling inside the guest solves the issue.

Well, then we'll get a 64k window and a "safely" low throughput. It looks like we are hitting again some issue with bursts...

I haven't tried to reproduce yet, even though I haven't observed anything similar on Fedora 38, but I had different parameters for QEMU. Thanks for providing the steps! I plan to have a look.
Comment 2 Matej Hrica 2023-08-29 12:45:38 UTC
Created attachment 23 [details]
passt1.pcap

on host:
$ ./passt -f -4 -t 5201 -u 5201

in guest:
$ iperf3 -s

on host:
$ iperf3 -t 1000 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 53402 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.50 MBytes  20.9 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
^C[  5]   7.00-7.39   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-7.39   sec  2.50 MBytes  2.84 Mbits/sec    0             sender
[  5]   0.00-7.39   sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated

on host:
$ iperf3 -t 1000 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 43318 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.50 MBytes  21.0 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
^C[  5]   8.00-8.78   sec  0.00 Bytes  0.00 bits/sec    0    320 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-8.78   sec  2.50 MBytes  2.39 Mbits/sec    0             sender
[  5]   0.00-8.78   sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated
Comment 3 Matej Hrica 2023-08-29 12:54:40 UTC
Yes it does happen without `--trace`, possibly less frequently (not sure), still happens on most iperf tests.
Giving less cores to the guest, may slightly lower the chance of this happening, but it still happens even with `-smp 1`


I am adding pcaps.
I ran it on a freshly installed Fedora 38 workstation guest, with just iperf3 installed, 
with "metered connection" enabled in the GUI network settings to minimize unrelated traffic.

guest kernel version:
$ uname -a
Linux hostname 6.2.9-300.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 30 22:32:58 UTC 2023 x86_64 GNU/Linux
(this is Fedora 38 workstation freshly installed from the iso without updates, but I tried updating it and I still have the same issue)


host kernel version:
$ uname -a
Linux m-rh-lap 6.4.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 23 17:46:49 UTC 2023 x86_64 GNU/Linux
Comment 4 Matej Hrica 2023-08-29 12:59:25 UTC
Another pcap, but this one too big to upload here.
https://drive.google.com/file/d/1CuZrsTEn2uJSXWCWpZYsMeik_XEc3wV5/view?usp=drive_link

Once again iperf3 -s is running in the guest, on the host i am doing the following tests:


$ iperf3 -t 1000 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 33028 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  21.2 MBytes   178 Mbits/sec    0   1.81 MBytes       
[  5]   1.00-2.00   sec  20.0 MBytes   168 Mbits/sec    0   1.81 MBytes       
[  5]   2.00-3.00   sec  22.5 MBytes   189 Mbits/sec    0   1.81 MBytes       
[  5]   3.00-4.00   sec  15.0 MBytes   126 Mbits/sec    0   1.81 MBytes       
[  5]   4.00-5.00   sec  21.2 MBytes   178 Mbits/sec    0   1.81 MBytes       
[  5]   5.00-6.00   sec  21.2 MBytes   178 Mbits/sec    0   1.81 MBytes       
[  5]   6.00-7.00   sec  18.8 MBytes   157 Mbits/sec    0   1.81 MBytes       
[  5]   7.00-8.00   sec  17.5 MBytes   147 Mbits/sec    0   1.81 MBytes       
[  5]   8.00-9.00   sec  20.0 MBytes   168 Mbits/sec    0   1.81 MBytes       
^C[  5]   9.00-9.24   sec  5.00 MBytes   178 Mbits/sec    0   1.81 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-9.24   sec   182 MBytes   166 Mbits/sec    0             sender
[  5]   0.00-9.24   sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated


$ iperf3 -t 1000 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 37022 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  21.2 MBytes   178 Mbits/sec    0   1.75 MBytes       
[  5]   1.00-2.00   sec  11.2 MBytes  94.3 Mbits/sec    3   1.81 MBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  12.00-13.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  13.00-14.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  18.00-19.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  19.00-20.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  20.00-21.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  21.00-22.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  22.00-23.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  23.00-24.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  24.00-25.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  25.00-26.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  26.00-27.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  27.00-28.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  28.00-29.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  29.00-30.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
[  5]  30.00-31.00  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
^C[  5]  31.00-31.78  sec  0.00 Bytes  0.00 bits/sec    0   1.81 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-31.78  sec  32.5 MBytes  8.58 Mbits/sec    3             sender
[  5]   0.00-31.78  sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated
Comment 5 David Gibson 2023-08-30 03:52:08 UTC
There's a lot going on in those packet captures, and I certainly haven't deciphered what's going on yet, but one thing that stands out to me is that wireshark is showing a bunch of "TCP Previous segment not captured" errors.  Basically it seems like passt is jumping forward in the stream, either simply failing to send some frames, or incorrectly advancing the sequence number.  Or it could be that it's sending them, but omitting them from the pcap file, but I think that's less likely.

Any of those options definitely seems like a bug, though I don't yet have much idea why it would do that, or what would trigger it.
Comment 6 David Gibson 2023-08-30 05:04:01 UTC
I've now reproduced this on my system (thanks for the detailed instructions).  I've hit it both with the Fedora live image you suggest, and also with an mbuto image based on my host configuration (also Fedora 38).

Why this is showing it, but the very similar iperf3 tests in our standard testsuite isn't showing it, I haven't figured out yet.
Comment 7 David Gibson 2023-08-30 05:24:38 UTC
Ok, found at least one reason why the testsuite isn't hitting this problem: in order to test for maximum possible throughput the testsuite is designed to work with very large socket buffers, net.core.rmem_max and net.core.wmem_max set to 16MiB.  When I do that I can no longer reproduce this bug.
Comment 8 David Gibson 2023-08-30 05:34:30 UTC
More precisely, I've been able to reproduce this bug very easily with wmem and rmem of 1MiB and below, and haven't managed to reproduce it at all with rmem/wmem 2MiB and up.
Comment 9 Stefano Brivio 2023-08-30 06:48:53 UTC
Matej, while this is under investigation, you can also give passt a bit more memory:

  sysctl -w net.core.rmem_max=$((16 * 1024 * 1024))
  sysctl -w net.core.wmem_max=$((16 * 1024 * 1024))

...especially on the host. Even 4 or 2 megs should "fix" your issue. Fedora uses a 208 KiB default for both.
Comment 10 David Gibson 2023-08-31 05:50:49 UTC
Ok, I've done a bunch of investigation and have a number of observations, although no final conclusion:

* If I force the low_rmem and low_wmem values inside passt to 0, I can't reproduce the problem.  This strongly suggests the problem is with our handling of small socket buffers rather than the fact that the socket buffers are small(ish) per se.

* Forcing just low_rmem=0 or just low_wmem to 0 is insufficient, it has to be both.

* More specifically, I can't reproduce if I remove exactly one effect each of low_rmem and low_wmem.  I can't reproduce if I remove just:
     - the test on !low_rmem when setting SO_RCVBUF in tcp_sock_set_bufsize()
AND  - the test on !low_wmem when setting SO_SNDBUF in tap_listen_handler()

* In the original case the logs (both from Matej and ones I reproduced myself) there are a bunch of messages like:
     236.0487: tap: dropped 35 frames of 39 due to short send
  These indicate that we're dropping frames over the link to qemu.  We don't presently handle that very elegantly: we simply let the frames drop and assume that TCP will figure it out.  The fact we're dropping frames here explains the "packet not captured" errors in the pcap files.  I initially thought that the dropping of frames this was a crucial part of the puzzle, but..

* If I remove the !low_rmem test on the TCP SO_RCVBUF, but *not* the !low_wmem test on the tap (Unix) SO_SNDBUF then I can still reproduce the problem (or one very like it), but I *don't* get the dropped frames.  So the stalling mechanism doesn't require frames to be
dropped.

Still looking...
Comment 11 David Gibson 2023-08-31 05:54:58 UTC
Forgot some additional observations:

* I can't reproduce if I limit the window used by iperf3  to ~256kiB with the -w option

* I *can* still reproduce if I clamp the window in tcp_clamp_window to 256kiB or 128kiB.
Comment 12 David Gibson 2023-08-31 06:30:52 UTC
I drilled down into what TCP SO_RCVBUF sizes seem to trigger the problem.  It seems to kick in when the RCVBUF size drops below 256kiB, or pretty close to that.

 * With SO_RCVBUF comfortably above 256kiB I get consistently high throughput ~2-3Gbytes/s
 * When SO_RCVBUF gets close to 256kiB it sometimes works well, sometimes starts slow (MiB/s rather than GiB/s) then speeds up to 2-3GiB/s, sometimes stays slow, sometimes stalls commpletely
 * As SO_RCVBUF drops below 256kiB chances of a good run seem to drop very rapidly.

[All values quoted above as reported by *getsockopt*, which means setting half the value with setsockopt, because... reasons].

At least on my distro/configuration not setting SO_RCVBUF at all (as we do with low_rmem == 1) results in SO_RCVBUF==~128K, well into the problem zone.
Comment 13 David Gibson 2023-08-31 09:56:01 UTC
Yet more experiments and observations:

 * I took a packet trace of a stall from the *host* side.  There the window advertised from the server (that is, from passt acting for the server) abruptly drops from ~64kiB to 112 bytes, then stays there.

 * If I use the -m option to clamp MTU to 64000 bytes, I can no longer reproduce the problem.  I really don't know how this fits in with everything else.
Comment 14 David Gibson 2023-09-01 04:22:33 UTC
Unsurprisingly given then MTU fact from comment 13 - but pretty surprising in isolation, clamping the (tap side) MSS for connections also seems to stop the problem from reproducing.  How much it needs to be clamped seems to be a bit fuzzy.  In intermediate areas it seems to increase the incidence of a case where the transfer starts slow(ish) - Mbps rather than Gbps - but after 1-2s recovers to the expected full speed.
Comment 15 David Gibson 2023-09-01 04:42:34 UTC
Expanding on comment 14:
  * Clamping MSS to 63960 (equivalent to MTU of 64000), I'm rarely see a full stall or consistently slow transfer, although slow starts are moderately common
  * With the unclamped MSS of 65436, I get slow transfers on most attempts and full stalls maybe 30-50% of the time.
  * With MSS clamped to 32000, stalls, slow transfers and slow starts are all rare, although I have still seen at least one stall.
Comment 16 David Gibson 2023-09-13 06:36:42 UTC
Sorry, I was working on bug 68 for a while, but have now gotten back to investigating this.

New observations:

 * I've confirmed that when "stalled" a trickle of data still seems to be getting through:
    - While the iperf3 client shows 0 bytes/s most of the time, the iperf3 server shows a few hundred bytes being transferred
    - If left running long enough, every so often the client shows a few hundred k transferred in an interval, then returns to 0 transfer

 * ss shows a substantial amount of data in the Send-Q of the iperf3 client's sending socket, which seems to reduce slowly over time

 * It looks like some buffer is filling up (though I'm not sure exactly which buffer) between iperf3 -c and passt, then draining very slowly until it has enough room to accept another chunk of data (this suggests some kind if hysteresis in the management of this buffer, wherever it is)

 * On a host side packet trace of a long, eventually stalling transfer I noticed these patterns, not sure if they're significant:
    - Initially window full events seem to be quite rare and irregular
    - Some time in (but before the full stall), window full events become regular around every 0.25s / 44 frames, then at the time of the stall window full events become super frequent, around every 0.2s / 2 frames.
    - All these "regular" window full events seem to come a significant delay ~0.2s after the previous packet.  Not sure what the cause of this delay is.
Comment 17 Stefano Brivio 2023-10-10 11:53:08 UTC
Matej, an updated package for Fedora 38 (https://bodhi.fedoraproject.org/updates/FEDORA-2023-b1e79e591e) is available. Can you check if that fixes the issue for you, before we close this?

By the way, the series (now merged) that should fix this is:
  https://archives.passt.top/passt-dev/20230929150446.2671959-1-sbrivio@redhat.com/
Comment 18 Matej Hrica 2023-10-12 12:12:01 UTC
It definetly works, though the throghput jumps around quite a bit, is this expected behavior?

$ iperf3 -t 1000 -c 127.0.0.1      
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 34144 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   328 MBytes  2.75 Gbits/sec    0   1.62 MBytes       
[  5]   1.00-2.00   sec  82.5 MBytes   692 Mbits/sec    0   2.00 MBytes       
[  5]   2.00-3.00   sec   289 MBytes  2.42 Gbits/sec    0   2.31 MBytes       
[  5]   3.00-4.00   sec   206 MBytes  1.73 Gbits/sec    1   2.44 MBytes       
[  5]   4.00-5.00   sec   440 MBytes  3.69 Gbits/sec    1   2.44 MBytes       
[  5]   5.00-6.00   sec   275 MBytes  2.31 Gbits/sec    0   2.44 MBytes       
[  5]   6.00-7.00   sec   525 MBytes  4.40 Gbits/sec    0   2.44 MBytes       
[  5]   7.00-8.00   sec   302 MBytes  2.54 Gbits/sec    0   2.44 MBytes       
[  5]   8.00-9.00   sec   180 MBytes  1.51 Gbits/sec    0   2.44 MBytes       
[  5]   9.00-10.00  sec   102 MBytes   860 Mbits/sec    0   2.44 MBytes       
[  5]  10.00-11.00  sec  52.5 MBytes   440 Mbits/sec    0   2.44 MBytes       
[  5]  11.00-12.00  sec   256 MBytes  2.15 Gbits/sec    0   2.44 MBytes       
[  5]  12.00-13.00  sec   172 MBytes  1.45 Gbits/sec    0   2.44 MBytes       
[  5]  13.00-14.00  sec   170 MBytes  1.43 Gbits/sec    0   2.44 MBytes       
[  5]  14.00-15.00  sec   189 MBytes  1.58 Gbits/sec    0   2.44 MBytes       
[  5]  15.00-16.00  sec   316 MBytes  2.65 Gbits/sec    0   2.44 MBytes       
[  5]  16.00-17.00  sec   114 MBytes   954 Mbits/sec    0   2.44 MBytes


Capping the sending speed of iperf improves the throguhput:
$ iperf3 -t 1000 -c 127.0.0.1 -b 5.5G
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 38560 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   655 MBytes  5.50 Gbits/sec    0   1.12 MBytes       
[  5]   1.00-2.00   sec   656 MBytes  5.50 Gbits/sec    0   1.12 MBytes       
[  5]   2.00-3.00   sec   656 MBytes  5.50 Gbits/sec    0   1.50 MBytes       
[  5]   3.00-4.00   sec   656 MBytes  5.50 Gbits/sec    0   1.50 MBytes       
[  5]   4.00-5.00   sec   656 MBytes  5.50 Gbits/sec    0   1.50 MBytes       
[  5]   5.00-6.00   sec   390 MBytes  3.27 Gbits/sec    7   1.19 MBytes       
[  5]   6.00-7.00   sec   139 MBytes  1.16 Gbits/sec    0   1.50 MBytes       
[  5]   7.00-8.00   sec  97.8 MBytes   820 Mbits/sec    9   1.06 MBytes       
[  5]   8.00-9.00   sec   505 MBytes  4.24 Gbits/sec    0   1.06 MBytes       
[  5]   9.00-10.00  sec   589 MBytes  4.94 Gbits/sec    0   1.06 MBytes       
[  5]  10.00-11.00  sec   163 MBytes  1.37 Gbits/sec    0   1.44 MBytes       
[  5]  11.00-12.00  sec   564 MBytes  4.73 Gbits/sec    0   1.44 MBytes       
[  5]  12.00-13.00  sec   502 MBytes  4.21 Gbits/sec    0   1.56 MBytes


These test are with default socket buffer sizes, both guest and host are Fedora 38.
$ sysctl net.core.rmem_max net.core.rmem_default net.core.wmem_max net.core.wmem_default
net.core.rmem_max = 212992
net.core.rmem_default = 212992
net.core.wmem_max = 212992
net.core.wmem_default = 212992
Comment 19 Matej Hrica 2023-10-12 12:57:40 UTC
Never mind, it seems like a QEMU performance issue.

With my implementation in libkrun, it works very nicely:

$ iperf3 -t 1000 -c 127.0.0.1   
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 52838 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.72 GBytes  14.8 Gbits/sec    0   1.25 MBytes       
[  5]   1.00-2.00   sec  1.75 GBytes  15.0 Gbits/sec    0   1.25 MBytes       
[  5]   2.00-3.00   sec  1.64 GBytes  14.1 Gbits/sec    0   1.25 MBytes       
[  5]   3.00-4.00   sec  1.72 GBytes  14.8 Gbits/sec    0   1.37 MBytes       
[  5]   4.00-5.00   sec  1.57 GBytes  13.5 Gbits/sec    0   1.37 MBytes       
[  5]   5.00-6.00   sec  1.73 GBytes  14.9 Gbits/sec    0   1.37 MBytes       
[  5]   6.00-7.00   sec  1.68 GBytes  14.5 Gbits/sec    0   1.37 MBytes       
[  5]   7.00-8.00   sec  1.09 GBytes  9.35 Gbits/sec    0   1.37 MBytes       
[  5]   8.00-9.00   sec  1.31 GBytes  11.3 Gbits/sec    0   1.37 MBytes       
[  5]   9.00-10.00  sec  1.68 GBytes  14.5 Gbits/sec    0   1.37 MBytes       
[  5]  10.00-11.00  sec  1.68 GBytes  14.4 Gbits/sec    0   1.37 MBytes       
[  5]  11.00-12.00  sec  1.66 GBytes  14.3 Gbits/sec    0   1.37 MBytes       
[  5]  12.00-13.00  sec  1.67 GBytes  14.4 Gbits/sec    0   1.37 MBytes       
[  5]  13.00-14.00  sec  1.70 GBytes  14.6 Gbits/sec    0   1.37 MBytes       
[  5]  14.00-15.00  sec  1.72 GBytes  14.8 Gbits/sec    0   1.37 MBytes
Comment 20 David Gibson 2023-10-13 03:45:21 UTC
(In reply to Matej Hrica from comment #18)
> It definetly works, though the throghput jumps around quite a bit, is this
> expected behavior?

Yes and no.  Although the changes we've made stop the specific mechanism of the stall you were seeing originally, we noticed during testing that there are still some performance oddities related to the buffer sizes.  We're still looking at those and thinking what we can do - it's pretty hard to pin down because there are rather a lot of variables which affect what's going on in complex ways.

So it's not intended behaviour, but it's kind of expected for now.

(In reply to Matej Hrica from comment #19)
> Never mind, it seems like a QEMU performance issue.

Huh... that's interesting, I never even thought to check that.


In any case, it seems like the original bug is resolved, so I'm closing this ticket.
Comment 21 David Gibson 2023-11-10 00:43:29 UTC
Hi Matej,

We've been debugging a bug very similar to this one over at:
    https://github.com/containers/podman/issues/20170

The conclusion to that seems to indicate that the fix we gave you for this bug wasn't really a fix, but just masked the problem in some cases.  We now have a fix that should be better.  If you want to try it out, it's here:

https://gitlab.com/dgibson/passt/-/tree/noclamp?ref_type=heads

Note You need to log in before you can comment on or make changes to this bug.