68 – pasta stop processing traffic and take all the cpu

Bug 68 - pasta stop processing traffic and take all the cpu

Summary: pasta stop processing traffic and take all the cpu

Status:	RESOLVED FIXED

Alias:	None

Product:	passt
Classification:	Unclassified
Component:	pasta (show other bugs)
Version:	unspecified
Hardware:	All Linux

Importance:	Normal normal
Assignee:	David Gibson

URL:

Depends on:
Blocks:

Reported:	2023-07-24 10:14 UTC by jlesev
Modified:	2023-09-25 05:59 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
with the pod and pasta running (1012.00 KB, text/x-log) 2023-07-24 10:14 UTC, jlesev	Details
the second log, when the offending pod is stopped (892.00 KB, text/plain) 2023-07-24 10:16 UTC, jlesev	Details
pasta --debug log (68.61 KB, application/gzip) 2023-08-20 17:45 UTC, jlesev	Details
container network info, when regression is observed (6.91 KB, application/gzip) 2023-08-20 17:49 UTC, jlesev	Details
pasta debug log moments after regression (91.64 KB, application/gzip) 2023-08-30 16:20 UTC, jlesev	Details
gdb and pasta debug log (68.81 KB, application/gzip) 2023-09-03 08:47 UTC, jlesev	Details
gdb-logs-2023-09-04 (1.07 KB, application/x-compressed-tar) 2023-09-04 06:47 UTC, jlesev	Details
pasta debug logs, with TCP Sequence debug statements (76.52 KB, application/x-compressed-tar) 2023-09-05 05:42 UTC, jlesev	Details
gdb backtraces from the regresion with pasta debug TCP Sequence statements (1.34 KB, application/x-compressed-tar) 2023-09-05 05:43 UTC, jlesev	Details
pasta-debug log with the root cause bug68 branch (76.59 KB, application/gzip) 2023-09-12 18:05 UTC, jlesev	Details
Show Obsolete (4) Add an attachment (proposed patch, testcase, etc.)

Description jlesev 2023-07-24 10:14:10 UTC

Created attachment 18 [details]
with the pod and pasta running

Context:
 Three pods on arm. Created tap interface and binding ports to it for the interpod communication.
 podman kube play --network pasta:-l,/tmp/pasterngcho.log,--trace civinfo.yaml
     ports:
    - containerPort: 80
      hostPort: 4443
this pod use the other two pods as backends via the tap interface

the other two (just for clarity)

podman kube play --network pasta civinfo-wiki.yaml
    ports:
    - containerPort: 80
      hostIP: 172.17.254.1
      hostPort: 8880

podman pod create -p 172.17.254.1:9080:80 --network pasta -n nextpriv

 172.17.254.1 is the address on the tap device. The offending pod creates connection to it

Behavior:
 After a while. Still did not notice pattern - usually after 24 - 48 hours. The offending pod stop answering to network packets.
 The corresponding for it pasta process takes all the cpu.

Attached logs are from version  2023_06_27.289301b
 - one is with the pod and pasta running
 - the other is after the pod is stopped

Comment 1 jlesev 2023-07-24 10:16:17 UTC

Created attachment 19 [details]
the second log, when the offending pod is stopped

the second log, when the offending pod is stopped

Comment 2 David Gibson 2023-08-08 02:00:40 UTC

Hi, sorry it's taken so long to look at this, we've had other high priority things to investigate.

Unfortunately the logs you've provided don't have any clues.  The log file is limited in size, and so it only shows a certain time window of events.  In what's remaining, the "working" and "not working" traces look effectively identical.

Let's get some basic information to start with:

1. In the offending pod in the broken state, can you get the output of "ip addr" and "ip route"?  We have one other reported bug (https://github.com/containers/podman/issues/19405) where a pod is losing its addresses and routes after some time.  I don't know if that bug is related to this one, but it might be.

2. Can you replace the '--trace' in the command line with '--debug', then see if you're still able to reproduce?  This will reduce the volume of log messages, so we'll see a longer window which might show something more useful.

3. I'm not entirely clean on what you mean by "address on the tap device".  Do you mean the address on the external interface as seen from within the offending pod?  Is this also an address on the host?

Comment 3 David Gibson 2023-08-15 04:00:21 UTC

If this is related to the podman issue described (now also recorded as bug 70) we have a draft fix here: https://gitlab.com/dgibson/passt/-/tree/podman19405?ref_type=heads

If you're able to try that, that would be helpful.

Comment 4 jlesev 2023-08-18 15:12:34 UTC

Thanks for feedback and suggestions.
 - compiled passt from the git source
 - changed --trace to --debug
 - tap device is created on the host, with assigned ipv4 address (172....). As i have few pods, i expose some ports binded to the tap device address. This way i can make just the desired connectivity across the pods. Previously i used exposing to the host ip address, and decide to make the setup more robust

 - will change the watchdog to include in the watchdog log the desired information for ip and routing table in the pods. (Right now it just logs that a regression is observed and restart the pods, after killing the offending hangt pasta process)

---- output snips 
>>>>> the watchdog log
/tmp/passt$ tail -n 8 /tmp/didirecover 

alarm     158671       1  0 14:00 ?        00:00:00 /usr/bin/pasta --config-net -t 4443-4443:80-80 -l /tmp/pasterngcho.log --trace -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-e899c7a5-1c69-3d02-586f-3ffcbd7c01a2
alarm     158728       1  0 14:00 ?        00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/8880-8880:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-fc6dd732-4ef1-6304-43f6-f7f5508bfdc7
alarm     158802       1  0 14:00 ?        00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/9080-9080:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-1b88ad22-cb44-abc6-b634-32acc985b998
alarm     158957  158378  0 14:00 ?        00:00:00 grep pasta


##### 2023-08-18 compiled pasta from git source (potential fix included) - debug, instead of trace in pasta settings

>>>>> the now used fresh from git pasta
[alarm@alarm]
/tmp/passt$ pasta --version
pasta 2023_06_27.289301b-47-g5f1fcff
Copyright Red Hat
GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


>>>>> applied changes for the log --debug level
[alarm@alarm]
/tmp/passt$ ps -ef|grep pasta
alarm     162634       1  0 17:44 ?        00:00:00 /usr/bin/pasta --config-net -t 4443-4443:80-80 -l /tmp/pasterngcho.log --debug -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-4e82d9ff-9189-7c2e-c0d9-340be9c3e24e
alarm     162697       1  0 17:45 ?        00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/8880-8880:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-77de9e1a-ab6b-a740-4434-5d62dbb76864
alarm     162811       1  0 17:45 ?        00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/9080-9080:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-3f51df6c-f8da-2bb4-98ae-6d85fd16e1c7
alarm     162948  139167  0 17:49 pts/0    00:00:00 grep pasta

Comment 5 jlesev 2023-08-20 17:45:03 UTC

Created attachment 20 [details]
pasta --debug log

pasta --debug log 
the log is copied after the regression was happen. 

- chenged my recover script to copy the log, so next time will have more relevant time from the log

Comment 6 jlesev 2023-08-20 17:49:31 UTC

Created attachment 21 [details]
container network info, when regression is observed

the whole log - last lines includes the recovery after using the git compiled pasta.
inclued is the raw network info from the container (seems ok to me, expected addresses and routes)
Also contains date marks, when the regression is observed (the check run every 10 minutes).
Did not know how to relate to the times in the pasta debug log.

Comment 7 jlesev 2023-08-20 17:55:50 UTC

Hello, 

The regression (no ingress network to the container) is observed with the compiled pasta too. 
 - The network in the offending container seems good to me (i see the expected ip address and route).

 - Uploaded two logs - the debug one from pasta, and the log from the recover script (this one contiains the netowrk info for the container in the last lines)

p.s. I already modified the recover scripts - it will make copy of the pasta debug log, as soon as regression is observed. As i have the feeling that the debug log can be overwriten with normal infromation, after the recovery

Comment 8 David Gibson 2023-08-21 05:54:06 UTC

Thanks for the information.  It doesn't look like this is related to the podman issue (now also recorded as bug 70).  I thought it probably wasn't, but it was worth checking.

I'm continuing to look through the logs to see if I can find any clues.

Comment 9 David Gibson 2023-08-22 03:29:09 UTC

Ok, there a few things in the logs that look a little suspicious, but nothing I can make a conclusion from so far. Stefano suspects this may be related to a known race between timers and connection shutdown. That's plausible, but by no means certain.

I'd like to get a clearer picture of the setup here. First, please correct me if any of my understandings below are incorrect:

* You have 3 pods, let's call then A (listening on host port 4443), B (listening on host port 8880) and C (listetning on host port 9080)
* All 3 pods are connected to the outside world with pasta
* It's (consistently) pod A which encounters the problem, losing connectivity
* You have sort of watchdog running on the host which detects when pod A becomes unreachable and resets it, also logging some information (which is what is in recover.log.gz)
* 172.17.254.1 is one of the host addresses

Then, let's get some further details:

1. Can I get the ip address and route information from the host system?

2. Can I get the ip address and route information from each of the 3 pods immediately after they start (i.e. when everything is still working properly)

3. You mention a tap device several times in comment 0, but I'm not sure exactly which one you mean. Each pasta instance creates a tap device. Is it one of these or a different tap device on the host you're referring to?

4. On each recovery in recover.log I see what looks like 2 process listings. What's the exact command you're getting output from, and when is each of these listings made during the recovery process?

Comment 10 jlesev 2023-08-23 07:23:05 UTC

Hello, thanks for looking at it.
Yes, will provide ip information (hope the text formatting below will persist:)
I am very new to pasta:) did not realize the potential tap device confusion. It is tuntap device (Kind=tap) on the host

>>>>>>>> recover.log (4.)
Sorry - it is very ugly logging - main use, despite for me was the network info in the pod.
However - it stops the pods and checks for pasta processes (which should be gone with the pods, if everything is normal).
As one pasta process is still existing - that proves to me, it is hangt. Then it is terminated with pkill.
As we now should have clean state (no pods, no leftover pasta process) - start pods again.
And check if there are the pasta processes (expected is to have 3 with new PIDs)
copied at the bottom for sanity:) (the beginning of the log was filled from earlier version of the script, i made modification to the script on the go, to include more relevant to me info. in short: the information in the log is not gathered the same way in the beginning and it the end.)

>>>>>>>> ip setup on the host and overview of pods and host tuntap device (1., 2., 3.)
>>> end0 is physical interface - provide  connectivity
>>> localer is tuntap device (Kind=tap) used just inside the host and pods


outside traffic-----> pod A____-> serves some content from the same pod A
......................|.\
......................|..\_____-> proxy pass to pod B, serves other content
.......................\_______-> proxy pass to pod C, serves another content

[alarm@alarm]
~$ podman ps   
CONTAINER ID  IMAGE                                    COMMAND               CREATED       STATUS      PORTS                      NAMES

Pod A
1d7965b47046  localhost/podman-pause:4.5.1-1685940105                        4 days ago    Up 3 days   0.0.0.0:4443->80/tcp       41ccc54e19bc-infra
2a97a019744a  docker.io/library/nginx:latest           nginx -g daemon o...  4 days ago    Up 3 days   0.0.0.0:4443->80/tcp       ngcho-ngcho

Pod B
10a86fe1a3f8  localhost/podman-pause:4.5.1-1685940105                        5 weeks ago   Up 3 days   172.17.254.1:8880->80/tcp  0aba7d4b6c3c-infra
68bfaa81554d  docker.io/library/mariadb:latest         mariadbd              5 weeks ago   Up 3 days   172.17.254.1:8880->80/tcp  civinfo-wiki-mariadb1
46b3e133fc1b  docker.io/library/mediawiki:latest       apache2-foregroun...  5 weeks ago   Up 3 days   172.17.254.1:8880->80/tcp  civinfo-wiki-wiki1

Pod C
0864724f1013  localhost/podman-pause:4.5.1-1685940105                        8 days ago    Up 3 days   172.17.254.1:9080->80/tcp  379308f82e1f-infra
1e7ce6edf3b4  docker.io/library/nextcloud:fpm-alpine   php-fpm               8 days ago    Up 3 days   172.17.254.1:9080->80/tcp  npriv
98c3d0551f00  docker.io/library/nginx:latest           nginx -g daemon o...  8 days ago    Up 3 days   172.17.254.1:9080->80/tcp  ngnr
43fbc5ba7e76  docker.io/library/redis:alpine           redis-server          8 days ago    Up 3 days   172.17.254.1:9080->80/tcp  relock



[alarm@alarm]
~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: localer: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether b6:09:51:d5:e7:ab brd ff:ff:ff:ff:ff:ff
    inet 172.17.254.1/32 scope global localer
       valid_lft forever preferred_lft forever
3: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:ba:60:54:22:77 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.34/24 brd 192.168.1.255 scope global end0
       valid_lft forever preferred_lft forever
    inet6 fe80::ba:60ff:fe54:2277/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever



>>>>>> ip in pod A, listening on 0.0.0.0:4443
[alarm@alarm]
~$ podman unshare nsenter --net=$(podman container inspect --format {{.NetworkSettings.SandboxKey}} 41ccc54e19bc-infra)
[root@alarm]
~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo 
       valid_lft forever preferred_lft forever
2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 1a:5f:e4:12:0e:85 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.34/24 brd 192.168.1.255 scope global end0
       valid_lft forever preferred_lft forever
    inet6 fe80::185f:e4ff:fe12:e85/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever


>>>>>> ip in pod B, listening on 172.17.254.1:8880
[alarm@alarm]
~$ podman unshare nsenter --net=$(podman container inspect --format {{.NetworkSettings.SandboxKey}} 0aba7d4b6c3c-infra)
[root@alarm]
~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo 
       valid_lft forever preferred_lft forever
2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether ee:d8:91:8c:81:ec brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.34/24 brd 192.168.1.255 scope global end0
       valid_lft forever preferred_lft forever
    inet6 fe80::ecd8:91ff:fe8c:81ec/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever
[root@alarm]
~# ip r
default via 192.168.1.1 dev end0 proto static 
192.168.1.0/24 dev end0 proto kernel scope link src 192.168.1.34 


>>>>> ip in pod C, listening on 172.17.254.1:9080
[alarm@alarm]
~$ podman unshare nsenter --net=$(podman container inspect --format {{.NetworkSettings.SandboxKey}} 379308f82e1f-infra)
[root@alarm]
~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo 
       valid_lft forever preferred_lft forever
2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 06:3d:e5:41:a7:1c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.34/24 brd 192.168.1.255 scope global end0
       valid_lft forever preferred_lft forever
    inet6 fe80::43d:e5ff:fe41:a71c/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever
[root@alarm]
~# ip r
default via 192.168.1.1 dev end0 proto static 
192.168.1.0/24 dev end0 proto kernel scope link src 192.168.1.34 




>>>>>>>>>> just for sanity - the recover script
[alarm@alarm]
~$ cat recover.sh 
#!/bin/bash
echo recover invoked
echo >> /tmp/didirecover
echo had to recover >> /tmp/didirecover
date >> /tmp/didirecover
echo >> /tmp/didirecover
echo ip from the pod namespace >> /tmp/didirecover
echo fib_trie >> /tmp/didirecover
podman exec ngcho-ngcho cat /proc/net/fib_trie >> /tmp/didirecover
echo route  >> /tmp/didirecover
podman exec ngcho-ngcho cat /proc/net/route >> /tmp/didirecover

echo check egress from the container >> /tmp/didirecover
podman exec  ngcho-ngcho curl -m 6.0 -sv https://passt.top 2>> /tmp/didirecover
echo will copy the now log to a dedicated file >> /tmp/didirecover
cp /tmp/pasterngcho.log /tmp/pasterengcho-just-noticed-regression.log

podman pod stop ngcho
podman pod stop nextpriv
podman pod stop civinfo-wiki

podman pod stop ngcho
podman pod stop nextpriv
podman pod stop civinfo-wiki 
echo >> /tmp/didirecover
echo pasta processes after stoping pods >> /tmp/didirecover
ps -ef |grep pasta >> /tmp/didirecover
pkill pasta

podman pod start ngcho
podman pod start civinfo-wiki
/home/alarm/rconfs/nextpriv/start.me
echo >> /tmp/didirecover
echo pasta processes after recovery >> /tmp/didirecover
ps -ef|grep pasta >> /tmp/didirecover
echo >> /tmp/didirecover

Comment 11 David Gibson 2023-08-29 04:33:06 UTC

Thanks for the additional information. I think I understand the setup better now.

* Pod A ("ngcho") accepts requests from the outside world on port 4443
* For certain requests pod A needs to make further requests to pods B ("civinfo-wiki") and C ("netxpriv")
* Those onward requests are made to address 172.17.254.1 ports 8880 and 9080

Based on the new information I can make a few preliminary observations

1) [trivial] You might be able to simplify your recovery script if you use these lines at the beginning:
set -e
exec >> /tmp/didirecover 2>&1
The "set -e" means the shell will echo each command as its executed, making it clearer which output goes with which command without having to add a bunch of additional echo commands. The exec + redirects sends all the output (stdout and stderr) to the log file, so you don't need explicit redirects on every command.

2) Although the requests from pod A to pods B & C are being sent do address 172.17.254.1, pods B&C will see them as arriving to address 192.168.1.34, since that's their IP address. That's not necessarily a problem, pasta will automatically perform this NAT, but it is a bit counter-intuitive.

3) For the communication between pods (rather than between pods and the outside world) pasta may not be a great solution. pasta's primary aim is to allow pod<->host and thereby pod<->world communications without host privilege. private networks between pods are possible with existing features and shouldn't require host privilege.

So, it might be a more natural solution to create a private network between pods using veth+bridge devices with the 172.17.x.y addresses. The pods can have a secondary network interface based on pasta purely for communication with the outside world. Pod A definitely needs this, B+C may or may not, but only pod A would need port forwarding communication - requests from pod A to pods B&C would go over the private bridge instead.

I'm not familiar enough with the details of network configuration in podman to know how to set that up.

4) There also may be a simpler way to set this up using just pasta interfaces, and not requiring the host 'localer' interface. By default podman uses pasta's --no-map-gw option, which means the host's primary IP isn't accessible from the pods, however that can be overridden. If --map-gw is used, then rather than using a secondary address from the 'localer' interface, pods B&C can be set up to listen on 127.0.0.1. Pod A can reach those services via the address of the host's default gateway.

Unfortunately, none of those observations actually point towards a cause of the original problem here stopping. It's possible that implementing approaches (3) or (4) would stop the problem, but I have no particular reason to suspect they would, and if they did it would just be avoiding the problem, not actually solving it.

I'm also still a bit confused by the role of the 'localer' device. Usually the way a tuntap device works is it has a network interface on one side, and a character device on the other. You use socket calls on the netif side, and read and write raw ethernet or IP frames on the character device side. It's not clear to me what's creating the localer interface here, and what's on the character device side of it. In any case, I don't think any traffic is actually going over the tap interface (from these pods, anyway), its only function in this setup AFAICT is to add an extra IP address to the host, which could equally be done with a dummy interface, or even just adding an IP to the main existing interface.

Again, I'm not sure any of that is relevant to the original problem here, although depending on what is on the chardev side of 'localer' maybe that could be confusing pasta somehow.

Next steps:

a) If you can tell me how the 'localer' interface is created and used on the host side, that might shed some light on things.

b) If you are able to attempt approaches (3) and/or (4) above, it would be useful to see if you still encounter the problem. If it does, it might lead us to a simpler reproducing case, if it doesn't it gives some clues as to what might be triggering the problem.

Comment 12 jlesev 2023-08-30 16:20:57 UTC

Created attachment 24 [details]
pasta debug log moments after regression

Comment 13 jlesev 2023-08-30 16:25:26 UTC

Just attached pasta debug log few moments (max 10 minutes) after another regression.

>>> Time, when the regression is observed (script runs every 10 minutes)
Wed Aug 30 13:40:17 EEST 2023

>>> the affected pod does not have outbound connectivity
check egress from the container
* Resolving timed out after 6000 milliseconds
* Closing connection 0

>>> the attached log 'pasta debug log moments after regression'
will copy the now log to a dedicated file

>>> the network ns id of the affected pod
pasta processes after stoping pods
alarm     260286       1  0 Aug28 ?        00:06:27 /usr/bin/pasta --config-net -t 4443-4443:80-80 -l /tmp/pasterngcho.log --debug -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-fe6fba3f-4fd0-4b5c-a223-7cde6431741e

Comment 14 David Gibson 2023-08-31 03:44:19 UTC

Thanks for the new log. I've had a look, unfortunately I can't spot any useful clues in it. From the other symptoms, I suspect we're getting into an infinite loop "locally" - that is to say in such a way that we're not returning to the main epoll loop at all. Along with that I suspect we're simply not sending any more log messages once the problem occurs. Unfortunately, I don't have any good theories on exactly *where* we're getting stuck.

It's time for some heavier tools.

I've made a special purpose branch of passt here:
https://gitlab.com/dgibson/passt/-/tree/bu68?ref_type=heads
For now this makes two changes: it changes the default compile options to include debugging symbols, and it prevents this version from clearing the DUMPABLE flag, meaning we can attach to it with debuggers.

Can you please:
1. Clone the git tree from that branch
2. make clean
(This is to remove any existing objects built without debug symbols)
3. make
(To rebuild with the debug symbols)
4. Install the newly built passt and passt.avx2 binaries to wherever you need so that podman uses them
5. Install gdb on your machine
6. Alter your recovery script so that before killing the stale pasta process, it executes:
gdb -p $PID /path/to/passt.avx2 --batch -ex 'bt'
Here $PID is the pid of the running, stalled pasta process, and /path/to/passt.avx2 is the full path to the binary you've built from the debugging branch.

The idea of the above is that we'll get a backtrace of where in the code the stuck pasta process is executing. With a bit of luck that should provide a more concrete clue as to what's going on here.

The extra information (a & b) requested in comment 11 would also be useful.

Comment 15 jlesev 2023-08-31 16:23:11 UTC

Sorry, overlooked and missed a & b in comment 11
>>> a) created via systemd-network:
cat /etc/systemd/network/02-taper.netdev 
[NetDev]
Name=localer
Kind=tap

cat /etc/systemd/network/03-localer.network 
[Match]
Name=localer

#[Link]
#RequiredForOnline=no

[Network]
#DHCPv6=no
ConfigureWithoutCarrier=true
Address=172.17.254.1/32

>>> b) 
Great suggestions, definitely will dive deep in them, as i tried very similar concept with my existing in that time knowledge. So will improve my knowledge. Huge Thanks for pointing me, really appreciate. I don't mind to use pasta just for the POD A, if that will simplify digging in the regression (till be able to implement the better approach, as you pointed me in it)

>>>b68 branch
from:
pasta --version
pasta 2023_06_27.289301b-47-g5f1fcff


now in use:
pasta --version
pasta 2023_08_23.a7e4bfb-1-gfedd2b6


p.s. there is some issue with my gdb complains about loading shared libraries.
Still get some backtraces though.
>>sample bt (while pod is running good)
0x0000ffff9a85a728 in ?? ()
#0  0x0000ffff9a85a728 in ?? ()
#1  0x0000fffff9fb1410 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
[Inferior 1 (process 279133) detached]

>> gdb in the recover script
no passt.avx2 after building just passt one:

file passt
passt: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, BuildID[sha1]=d03a642201957c5e8067582ef47ea76759c9b743, for GNU/Linux 3.7.0, with debug_info, not stripped

As was not able timely to fix the gdb complains, at least will gather the gdb logs as the user, as the root, within podman unshare. Immediately when the regression is observed, and after the pod is stoped.  

>>> invoking gdb within the scrpit
OFPP=$(pgrep -f "pasta.*debug")
podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-podman-unshre-$(date).log"
gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log"
sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-sudo-$(date).log"

Comment 16 David Gibson 2023-09-01 03:07:42 UTC

(In reply to jlesev from comment #15)
> Sorry, overlooked and missed a & b in comment 11
> >>> a) created via systemd-network:
> cat /etc/systemd/network/02-taper.netdev 
> [NetDev]
> Name=localer
> Kind=tap
> 
> cat /etc/systemd/network/03-localer.network 
> [Match]
> Name=localer
> 
> #[Link]
> #RequiredForOnline=no
> 
> [Network]
> #DHCPv6=no
> ConfigureWithoutCarrier=true
> Address=172.17.254.1/32

<reads up on systemd-networkd>.

Ok.. unless there's some other component connecting to the chardev side, I think this is basically just making a disconnected tap device, which is.. fairly pointless.  I believe using 'Kind=dummy' should work, without creating an essentially unused tap device.  Alternatively you could change the network configuration on your main interface to add the 172.17 IP, which AFAICT is the only relevant thing this is actually doing.

> >>> b) 
> Great suggestions, definitely will dive deep in them, as i tried very
> similar concept with my existing in that time knowledge. So will improve my
> knowledge. Huge Thanks for pointing me, really appreciate. I don't mind to
> use pasta just for the POD A, if that will simplify digging in the
> regression (till be able to implement the better approach, as you pointed me
> in it)

Ok, sounds good, no rush.  Simplifying the reproducing situation is certainly valuable though: it gives a better chance to figure out what the crucial details are to reproduce this.

> >>>b68 branch
> from:
> pasta --version
> pasta 2023_06_27.289301b-47-g5f1fcff
> 
> 
> now in use:
> pasta --version
> pasta 2023_08_23.a7e4bfb-1-gfedd2b6
> 
> 
> p.s. there is some issue with my gdb complains about loading shared
> libraries.

That's fairly normal - it means we won't be able to resolve backtraces into those libraries, but at this stage I don't think that will be necessary.

> Still get some backtraces though.
> >>sample bt (while pod is running good)
> 0x0000ffff9a85a728 in ?? ()
> #0  0x0000ffff9a85a728 in ?? ()
> #1  0x0000fffff9fb1410 in ?? ()
> Backtrace stopped: not enough registers or memory available to unwind further
> [Inferior 1 (process 279133) detached]

Hmm.. the fact that it wasn't able to resolve any symbols in that trace (not even main()) is concerning.  With an idle, working pasta we're probably sitting in the epoll_wait() call most of the time.  So without symbols for libc or other libraries, main's probably the only one that could be resolve.  It's possible some detail of arm means it can't be, but we might still get results from the "real" case, where I expect a deeper call stack.

That said, while I don't think we need symbols for every shared library, it might be worth installing 'glibc-debuginfo' so that we can resolve at least libc symbols

> >> gdb in the recover script
> no passt.avx2 after building just passt one:

Oh, sorry, I forgot this was arm, not x86_64 - the avx2 variant is x86 specific.  Just copying the passt binary is sufficient then.
 
> file passt
> passt: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV),
> dynamically linked, interpreter /lib/ld-linux-aarch64.so.1,
> BuildID[sha1]=d03a642201957c5e8067582ef47ea76759c9b743, for GNU/Linux 3.7.0,
> with debug_info, not stripped
> 
> As was not able timely to fix the gdb complains, at least will gather the
> gdb logs as the user, as the root, within podman unshare. 

Ah, yes, Again, I'm not super familiar with all the details of podman, so I guess you need that to be in the right context to debug the pasta process it starts.

> Immediately when
> the regression is observed, and after the pod is stoped.  
> 
> >>> invoking gdb within the scrpit
> OFPP=$(pgrep -f "pasta.*debug")
> podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >>
> "/tmp/gdb-podman-unshre-$(date).log"
> gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log"
> sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >>
> "/tmp/gdb-sudo-$(date).log"

You seem to have OFPP in one place then $FFOO in another - was that just a transcription error into the bug, or do you need to correct your script?  I _think_ you only need the podman unshare version, not the others.

I guess we'll see what the results are.

Comment 17 jlesev 2023-09-01 10:01:31 UTC

Hello
 - simplified the setup - just one pasta pod. Changed tap to dummy (10x). I have few reasons to avoid -map-gaw, as the host itself is listening to non relevant for the pods services, for which i'd like to keep just the enough possibility to enter. So in short same topology, despite just one pasta and kind dummy for the localer interface

 - did not manage to overcome the shared library gdb. So recompiled with make static 
 - fixed the typo for the PID variable 
>>> snippets:
grep FFOO ./recover.sh 
FFOO=$(pgrep -f "pasta.*debug")
echo offending pod PID is $FFOO >> /tmp/didirecover
podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-podman-unshre-$(date).log"
gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log"
sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-sudo-$(date).log"
podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-podman-unshre-$(date).log"
gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-$(date).log"

all of them now produced this:

warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
0x000000000046fc38 in epoll_pwait ()
#0  0x000000000046fc38 in epoll_pwait ()
#1  0x000000000040080c in main (argc=<optimized out>, argv=<optimized out>) at passt.c:302
[Inferior 1 (process 5735) detached]

Comment 18 jlesev 2023-09-03 08:47:11 UTC

Created attachment 25 [details]
gdb and pasta debug log

Comment 19 jlesev 2023-09-03 08:58:19 UTC

Just added gdb and pasta debug logs - attachment: gdb and pasta debug log 

>>> gethreing them with comments

>#just after catching the regerss pod is still running
cp /tmp/pasterngcho.log "/tmp/pasterengcho-just-noticed-regression-$(date).log"

FFOO=$(pgrep -f "pasta.*debug")
echo offending pod PID is $FFOO >> /tmp/didirecover
podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-podman-unshre-$(date).log"
gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log"
sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-sudo-$(date).log"
>##the above gdb backtraces are with pod running

>#at this point pod is already stopped
echo pasta processes after stoping pods >> /tmp/didirecover
ps -ef |grep pasta >> /tmp/didirecover
>>>>### the existence of passta process validates, that observation is really the regresion (avoid scenarios, when probably it is just the dns and internet issue)

echo will gather gdb logs again, as just the pasta process is here
podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-podman-unshre-$(date).log"
gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-$(date).log"
sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-sudo-$(date).log"
>### the above gdb backtraces are with pod stopped, and still persistent pasta process

Comment 20 jlesev 2023-09-04 06:47:25 UTC

Created attachment 26 [details]
gdb-logs-2023-09-04

Comment 21 David Gibson 2023-09-04 09:03:00 UTC

(In reply to jlesev from comment #17)
> Hello
>  - simplified the setup - just one pasta pod.

Ok, great.  How are you doing the communication between the pods now?  With a bridge device?

>  Changed tap to dummy (10x). I
> have few reasons to avoid -map-gaw, as the host itself is listening to non
> relevant for the pods services, for which i'd like to keep just the enough
> possibility to enter. 

Sorry, I don't quite understand what you're saying here.

> So in short same topology, despite just one pasta and
> kind dummy for the localer interface
> 
>  - did not manage to overcome the shared library gdb. So recompiled with
> make static 
>  - fixed the typo for the PID variable 
> >>> snippets:
> grep FFOO ./recover.sh 
> FFOO=$(pgrep -f "pasta.*debug")
> echo offending pod PID is $FFOO >> /tmp/didirecover
> podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >>
> "/tmp/gdb-podman-unshre-$(date).log"
> gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log"
> sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >>
> "/tmp/gdb-sudo-$(date).log"
> podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >>
> "/tmp/gdb-just-pasta-podman-unshre-$(date).log"
> gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >>
> "/tmp/gdb-just-pasta-$(date).log"
> 
> all of them now produced this:

I'm not sure what you mean by "them" here.  Do you mean running each of these gdb variants on an idle/working pasta?

> warning: Target and debugger are in different PID namespaces; thread lists
> and other data are likely unreliable.  Connect to gdbserver inside the
> container.
> 0x000000000046fc38 in epoll_pwait ()
> #0  0x000000000046fc38 in epoll_pwait ()
> #1  0x000000000040080c in main (argc=<optimized out>, argv=<optimized out>)
> at passt.c:302
> [Inferior 1 (process 5735) detached]

If this is a working pasta this is more or less expected - we're waiting in epoll_wait() for the next event to respond to.

Comment 22 David Gibson 2023-09-04 09:13:49 UTC

Thanks for the new logs with gdb traces.  AFAICT all the methods you used to invoke gdb worked just fine - that turns out to be useful, because it's effectively given us several backtraces a few seconds apart, which gives us a bit more information about what's going on.

I can't draw a conclusion from the information yet, but it looks like we're getting into an infinite loop within tcp_tap_handler().  My next step is to add some more focused instrumentation there to figure out why.

Comment 23 David Gibson 2023-09-04 09:57:20 UTC

Ok, after more detailed examination of the gdb traces, my best guess is that we're entering an infinite (or at least, far too long) loop in tap4_handler() around the call to tcp_tap_handler().  It's possible that's caused by some kind of underflow on the packet count.

I've added some specific instrumentation to check this, and I've made a new branch including it here:

    https://gitlab.com/dgibson/passt/-/tree/bug68?ref_type=heads

[Note this is a different branch name from previous - I corrected the typo I made last type]

Could you please build and install this new pasta and send me the logs after the next failure.  This time it's the debug log from pasta that's interesting, rather than the gdb traces that's of most interest.

Comment 24 jlesev 2023-09-04 10:56:28 UTC

>> pasta
pasta compiled with commit b28e9fc3e49a442893c19e26220335b3fac3d829
Still compiled as static binary - using "make static", to be able to provide both kinds of information:  pasta debug log and gdb backtraces

>> pod topology
All the three pods are rootless.
POD B and POD C are now using the netavark network type. Their ports are exposed just to the host dummy interface.
POD A is using the passt. It access POD B and POD C via the host dummy interface:port 

IF i start the pasta pod with -map-gw - As i understood this means the containers in pod will be able to access every network service listening on the host, by creating/attempting connection to their default gw. I have run some test to validate my understanding. Containers - being able to access network service listening on the host, especially listening on the host 127.0.0.1 is something i am not ready for. That's why i am still using the host dummy interface, instead mapping pod B and pod C to the host 127.0.0.1 and access their service within pod A (if started with -map-gw) via the gateway:port

Comment 25 David Gibson 2023-09-05 01:42:38 UTC

(In reply to jlesev from comment #24)
> >> pasta
> pasta compiled with commit b28e9fc3e49a442893c19e26220335b3fac3d829
> Still compiled as static binary - using "make static", to be able to provide
> both kinds of information:  pasta debug log and gdb backtraces

Ok, great.

> >> pod topology
> All the three pods are rootless.
> POD B and POD C are now using the netavark network type. Their ports are
> exposed just to the host dummy interface.
> POD A is using the passt. It access POD B and POD C via the host dummy
> interface:port 

Ok.  I'm not familiar with netavark, but at this point I don't think it really matters.  It looks very much like whatever's going wrong is entirely within the the pod A pasta, and the network of the other pods doesn't matter that much.

> IF i start the pasta pod with -map-gw - As i understood this means the
> containers in pod will be able to access every network service listening on
> the host, by creating/attempting connection to their default gw. I have run
> some test to validate my understanding. Containers - being able to access
> network service listening on the host, especially listening on the host
> 127.0.0.1 is something i am not ready for. That's why i am still using the
> host dummy interface, instead mapping pod B and pod C to the host 127.0.0.1
> and access their service within pod A (if started with -map-gw) via the
> gateway:port

Ok, that's fair.  --map-gw does expose the host's loopback interface to the pod.  Do be aware that all IPs on the host except for loopback and the IP that pasta picks for the pod are exposed to the container, even with --no-map-gw.  This includes IPs that would not effectively be accessible from externally, like the one on the dummy interface (and in fact you're relying on this).  So what protection is afforded by --no-map-gw is fairly fragile.

Comment 26 jlesev 2023-09-05 05:42:34 UTC

Created attachment 27 [details]
pasta debug logs, with TCP Sequence debug statements

Comment 27 jlesev 2023-09-05 05:43:55 UTC

Created attachment 28 [details]
gdb backtraces from the regresion with pasta debug TCP Sequence statements

Comment 28 jlesev 2023-09-05 05:53:11 UTC

Added 2 attachments:
 gdb backtraces from the regresion with pasta debug TCP Sequence statements
 pasta debug logs, with TCP Sequence debug statements 

Regression is observed twice, once at "Mon" and once at "Tue".
 pasta debugs contain the 2 copied at the time pasta debug log
 gbd backtraces contains the 2 sets of the 6 gbd backtraces - 3 in the moment when pod and pasta are running, 3 in the moment when pod is stopped and pasta is still running.

10x for clarifying the -(no)map-gw. As well for your cooperation and professionalism.

Comment 29 David Gibson 2023-09-06 04:26:01 UTC

Thanks for the new logs.  They have confirmed that a) yes, we are definitely looping in the place I thought and b) the reason is an underflow of the counter of packets.

I'm now looking at the code to see why that might happen.  I think we're pretty close to pinning this down now.  Not sure if I'll be able to figure it out from here, or if we'll need another round of specially instrumented pasta.

Comment 30 David Gibson 2023-09-06 07:04:47 UTC

Small update.  I've looked at the code and discussed with Stefano.  We've spotted a number of bugs there, although I'm not yet certain exactly which combination of them is causing the symptoms you're seeing.

Stefano and I are working out how to tackle the bugs, and in the meantime I hope to send another instrumented pasta version to check more precisely how we're getting into the situation you're seeing.  I'm moderately confident that fixing the bugs we've seen will fix the problem here, but it would be good to really understand exactly how the symptoms you've described arose.  If nothing else that might let us construct a minimal regression test case.

Comment 31 jlesev 2023-09-06 15:31:18 UTC

10x for your efforts. I am willing to help to dig down the rootcause. Will do happily tests with passta versions.

Comment 32 David Gibson 2023-09-07 03:42:57 UTC

Ok, what I hope will be the last round of instrumentation to pin down the root cause is ready.  You can download at 
    https://gitlab.com/dgibson/passt/-/tree/bug68?ref_type=heads

Note that the behaviour of this version should be a bit different.  Instead of entering a (near) infinite loop when the problem occurs, it should detect the problem and exit instead.  The point of this is that we don't flush out the debugging showing the origin of the problem with thousands of messages as we loop.  With the revised version the debug logs should finish with the relevant logging of how the problem originates.

Comment 33 David Gibson 2023-09-08 03:59:01 UTC

I've now posted draft fixes for a number of problems in tcp_tap_handler(), which I think will fix this bug amongst others.  You can get them at:
    https://gitlab.com/dgibson/passt/-/tree/bug68c?ref_type=heads

However, if you can keep running the previous version for a bit to get the extra instrumentation output, that would be still be helpful.  As soon as you get a reproduce with that version, go ahead and try the proposed fix and see if it works.

Comment 34 jlesev 2023-09-08 14:44:36 UTC

Hello, 
Now using: 9536f3d bug68: root cause debugging 

I understood, that this instrumental passta will exit - to keep the debug log clean. However, just in case, i still compiled it with make static, so to be able to provide gdb backtraces even if the pasta does not exit as expected.

Comment 35 David Gibson 2023-09-11 23:40:25 UTC

(In reply to jlesev from comment #34)
> Hello, 
> Now using: 9536f3d bug68: root cause debugging 
> 
> I understood, that this instrumental passta will exit - to keep the debug
> log clean. However, just in case, i still compiled it with make static, so
> to be able to provide gdb backtraces even if the pasta does not exit as
> expected.

Ok.  I don't think that matters, but it certainly won't hurt.

Comment 36 jlesev 2023-09-12 18:05:22 UTC

Created attachment 29 [details]
pasta-debug log with the root cause bug68 branch

Comment 37 David Gibson 2023-09-13 04:32:58 UTC

Thanks for the information, I think I've pieced together the scenario that was causing the problem.

1. In the batch of packets before the one with the underflow, we received a FIN, setting the TAP_FIN_RCVD event in tcp_data_from_tap().  We are now waiting for the sock->tap side of the connection to close (the CLOSE_WAIT TCP state, w.r.t. the tap side).

2. We haven't received a close from the socket side, so conn_event_do(TAP_FIN_RCVD) also sets the ACTIVE_CLOSE flag.

3. Upon return from tcp_data_from_tap() we shutdown() the write side of the socket, and set the SOCK_FIN_SENT event in tcp_tap_handler().

4. We finish with the penultimate batch of packets

5. The socket side completes closing, causing a POLLRDHUP or POLLHUP epoll event, which causes the SOCK_FIN_RCVD event to be set on the connection in tcp_tap_sock_handler().  We are now waiting for the tap side to ack any last data we sent (the LAST-ACK state w.r.t. the tap side)..

6. We receive the final batch of packets from the tap interface.  The first packet in the batch is the final ack we're expecting, then there is at least one additional packet.  The extra packet could be a SYN for a new connection, it could be something out of order, or it could be due to a bug on the guest/container side - in any case it's beyond pasta's control.

7. In tcp_tap_handler() we process the final ack packet.  We enter the TAP_FIN_RCVD subpath, since we already set that event.  Since we also have SOCK_FIN_RCVD and this *is* the final ack, we close the connection.

8.  Calling conn_event(.., CLOSED) clears all other events in conn_event_do().

9. We now return 1 from tcp_tap_handler().  Because we haven't consumed all the packets in the batch, tcp_tap_handler() is called again.  In this loop we now have n == (p->count - 1).

10. Due to a bug fixed in the series I've sent, rather than moving onto the next packet, tcp_tap_handler() examines the final ack packet again.

11. Due to another bug fixed in the series I've sent, tcp_tap_handler() matches the addresses and finds the connection, even though it's closed.

12. Because the FIN events were cleared when we CLOSED the connection, we don't enter the TAP_FIN_RCVD subpath this time, and instead proceed on to tcp_data_from_tap().  We return p->count from tcp_tap_handler().

13. In the caller we subtract p->count from n, which was previously equal to p->count - 1, we subtract p->count and now have n == -1 == 2^64-1.  We now loop essentially forever trying to reduce this to 0 again. (Except in the latest instrumented version we detect this situation and abort instead).

Comment 38 David Gibson 2023-09-13 04:39:44 UTC

Ok, having identified the course of events which led to the bug, I'm even more confident that the fixes we've already merged should fix it.  In fact it should fix it in at least 3 different ways:

a) Rather than double processing the last ack, we should move onto the additional packets in the batch, which might start a new connection or trigger a TCP reset. (fixed by commit 043a70b885 "tcp, tap: Correctly advance through packets in tcp_tap_handler()").

b) Once we've fully closed the connection we won't consider any new packets as belonging ti it, so instead they'll be processed either as the start of a new connection, or stray packets which will trigger a reset (fixed by commit 5fb376de6e "tcp: Never hash match closed connections")

c) Even if this or some other sequence of packets causes tcp_tap_handler() to incorrectly return a value greater than the number of remaining packets in the batch, this will just cause the loop in tap.c to exit, rather than underflowing and looping forever (also fixed vby  commit 043a70b885 "tcp, tap: Correctly advance through packets in tcp_tap_handler()")
.

jlesev, can you please try the latest git master branch.  I'm pretty sure the bug will be fixed, but it would be great to confirm it.

Comment 39 jlesev 2023-09-13 09:37:04 UTC

Sure.

now using:

commit 05627dc5127bee9f4df77351575572d6f4ce4c7e
Author: Stefano Brivio <sbrivio@redhat.com>
Date:   Fri Sep 8 17:34:27 2023 +0200


pasta --version
pasta 2023_09_08.05627dc

will be able to provide logs if something goes wrong.

I did not succeed with repeatable and timely manner to trigger the regression. So i will try to put more network load on the services and will share status every other day. I am open to suggestion too:)

Comment 40 David Gibson 2023-09-14 01:21:25 UTC

Right, as always with slow to reproduce problems it's hard to be sure it's really fixed, rather than just you got lucky/unlucky this time.  Given the previous frequency of the problem occurring, I think we could be reasonably confident once it's run for a week without problems.

Comment 41 jlesev 2023-09-15 16:54:10 UTC

No regression observed so far.
About the network load:
 - continious : nmap scripts (loped) target the service with http request / methods
 - few times executed apache ab loads with variuos (16 - 256) simultanious connections
 - few times executed heavy nmap spider / crawlers
 - few times executed various online web page performance testers/analyzers

I am heppily open for advises for any, specific, heavy test to stress the service.

Comment 42 David Gibson 2023-09-18 01:38:37 UTC

Those loading approaches seem reasonable to me, I don't have any additional ideas.

Comment 43 jlesev 2023-09-18 05:50:02 UTC

Still going strong, no regression observed.

Comment 44 David Gibson 2023-09-19 02:53:18 UTC

(In reply to jlesev from comment #43)
> Still going strong, no regression observed.

Good to hear.  Looking promising.

Comment 45 jlesev 2023-09-21 13:06:50 UTC

Still going OK

Comment 46 David Gibson 2023-09-25 00:48:21 UTC

Assume it's still going ok, I think we can now be pretty confident we've squashed the bug.  If you confirm, I'll close this BZ.

Comment 47 jlesev 2023-09-25 05:33:42 UTC

Still going OK. I am confident the bug is squashed. You are amazing and doing great and valuable job. 10x

Comment 48 David Gibson 2023-09-25 05:59:32 UTC

Thanks :).

Closing this bug.

Note You need to log in before you can comment on or make changes to this bug.