Created attachment 18 [details] with the pod and pasta running Context: Three pods on arm. Created tap interface and binding ports to it for the interpod communication. podman kube play --network pasta:-l,/tmp/pasterngcho.log,--trace civinfo.yaml ports: - containerPort: 80 hostPort: 4443 this pod use the other two pods as backends via the tap interface the other two (just for clarity) podman kube play --network pasta civinfo-wiki.yaml ports: - containerPort: 80 hostIP: 172.17.254.1 hostPort: 8880 podman pod create -p 172.17.254.1:9080:80 --network pasta -n nextpriv 172.17.254.1 is the address on the tap device. The offending pod creates connection to it Behavior: After a while. Still did not notice pattern - usually after 24 - 48 hours. The offending pod stop answering to network packets. The corresponding for it pasta process takes all the cpu. Attached logs are from version 2023_06_27.289301b - one is with the pod and pasta running - the other is after the pod is stopped
Created attachment 19 [details] the second log, when the offending pod is stopped the second log, when the offending pod is stopped
Hi, sorry it's taken so long to look at this, we've had other high priority things to investigate. Unfortunately the logs you've provided don't have any clues. The log file is limited in size, and so it only shows a certain time window of events. In what's remaining, the "working" and "not working" traces look effectively identical. Let's get some basic information to start with: 1. In the offending pod in the broken state, can you get the output of "ip addr" and "ip route"? We have one other reported bug (https://github.com/containers/podman/issues/19405) where a pod is losing its addresses and routes after some time. I don't know if that bug is related to this one, but it might be. 2. Can you replace the '--trace' in the command line with '--debug', then see if you're still able to reproduce? This will reduce the volume of log messages, so we'll see a longer window which might show something more useful. 3. I'm not entirely clean on what you mean by "address on the tap device". Do you mean the address on the external interface as seen from within the offending pod? Is this also an address on the host?
If this is related to the podman issue described (now also recorded as bug 70) we have a draft fix here: https://gitlab.com/dgibson/passt/-/tree/podman19405?ref_type=heads If you're able to try that, that would be helpful.
Thanks for feedback and suggestions. - compiled passt from the git source - changed --trace to --debug - tap device is created on the host, with assigned ipv4 address (172....). As i have few pods, i expose some ports binded to the tap device address. This way i can make just the desired connectivity across the pods. Previously i used exposing to the host ip address, and decide to make the setup more robust - will change the watchdog to include in the watchdog log the desired information for ip and routing table in the pods. (Right now it just logs that a regression is observed and restart the pods, after killing the offending hangt pasta process) ---- output snips >>>>> the watchdog log /tmp/passt$ tail -n 8 /tmp/didirecover alarm 158671 1 0 14:00 ? 00:00:00 /usr/bin/pasta --config-net -t 4443-4443:80-80 -l /tmp/pasterngcho.log --trace -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-e899c7a5-1c69-3d02-586f-3ffcbd7c01a2 alarm 158728 1 0 14:00 ? 00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/8880-8880:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-fc6dd732-4ef1-6304-43f6-f7f5508bfdc7 alarm 158802 1 0 14:00 ? 00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/9080-9080:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-1b88ad22-cb44-abc6-b634-32acc985b998 alarm 158957 158378 0 14:00 ? 00:00:00 grep pasta ##### 2023-08-18 compiled pasta from git source (potential fix included) - debug, instead of trace in pasta settings >>>>> the now used fresh from git pasta [alarm@alarm] /tmp/passt$ pasta --version pasta 2023_06_27.289301b-47-g5f1fcff Copyright Red Hat GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. >>>>> applied changes for the log --debug level [alarm@alarm] /tmp/passt$ ps -ef|grep pasta alarm 162634 1 0 17:44 ? 00:00:00 /usr/bin/pasta --config-net -t 4443-4443:80-80 -l /tmp/pasterngcho.log --debug -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-4e82d9ff-9189-7c2e-c0d9-340be9c3e24e alarm 162697 1 0 17:45 ? 00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/8880-8880:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-77de9e1a-ab6b-a740-4434-5d62dbb76864 alarm 162811 1 0 17:45 ? 00:00:00 /usr/bin/pasta --config-net -t 172.17.254.1/9080-9080:80-80 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-3f51df6c-f8da-2bb4-98ae-6d85fd16e1c7 alarm 162948 139167 0 17:49 pts/0 00:00:00 grep pasta
Created attachment 20 [details] pasta --debug log pasta --debug log the log is copied after the regression was happen. - chenged my recover script to copy the log, so next time will have more relevant time from the log
Created attachment 21 [details] container network info, when regression is observed the whole log - last lines includes the recovery after using the git compiled pasta. inclued is the raw network info from the container (seems ok to me, expected addresses and routes) Also contains date marks, when the regression is observed (the check run every 10 minutes). Did not know how to relate to the times in the pasta debug log.
Hello, The regression (no ingress network to the container) is observed with the compiled pasta too. - The network in the offending container seems good to me (i see the expected ip address and route). - Uploaded two logs - the debug one from pasta, and the log from the recover script (this one contiains the netowrk info for the container in the last lines) p.s. I already modified the recover scripts - it will make copy of the pasta debug log, as soon as regression is observed. As i have the feeling that the debug log can be overwriten with normal infromation, after the recovery
Thanks for the information. It doesn't look like this is related to the podman issue (now also recorded as bug 70). I thought it probably wasn't, but it was worth checking. I'm continuing to look through the logs to see if I can find any clues.
Ok, there a few things in the logs that look a little suspicious, but nothing I can make a conclusion from so far. Stefano suspects this may be related to a known race between timers and connection shutdown. That's plausible, but by no means certain. I'd like to get a clearer picture of the setup here. First, please correct me if any of my understandings below are incorrect: * You have 3 pods, let's call then A (listening on host port 4443), B (listening on host port 8880) and C (listetning on host port 9080) * All 3 pods are connected to the outside world with pasta * It's (consistently) pod A which encounters the problem, losing connectivity * You have sort of watchdog running on the host which detects when pod A becomes unreachable and resets it, also logging some information (which is what is in recover.log.gz) * 172.17.254.1 is one of the host addresses Then, let's get some further details: 1. Can I get the ip address and route information from the host system? 2. Can I get the ip address and route information from each of the 3 pods immediately after they start (i.e. when everything is still working properly) 3. You mention a tap device several times in comment 0, but I'm not sure exactly which one you mean. Each pasta instance creates a tap device. Is it one of these or a different tap device on the host you're referring to? 4. On each recovery in recover.log I see what looks like 2 process listings. What's the exact command you're getting output from, and when is each of these listings made during the recovery process?
Hello, thanks for looking at it. Yes, will provide ip information (hope the text formatting below will persist:) I am very new to pasta:) did not realize the potential tap device confusion. It is tuntap device (Kind=tap) on the host >>>>>>>> recover.log (4.) Sorry - it is very ugly logging - main use, despite for me was the network info in the pod. However - it stops the pods and checks for pasta processes (which should be gone with the pods, if everything is normal). As one pasta process is still existing - that proves to me, it is hangt. Then it is terminated with pkill. As we now should have clean state (no pods, no leftover pasta process) - start pods again. And check if there are the pasta processes (expected is to have 3 with new PIDs) copied at the bottom for sanity:) (the beginning of the log was filled from earlier version of the script, i made modification to the script on the go, to include more relevant to me info. in short: the information in the log is not gathered the same way in the beginning and it the end.) >>>>>>>> ip setup on the host and overview of pods and host tuntap device (1., 2., 3.) >>> end0 is physical interface - provide connectivity >>> localer is tuntap device (Kind=tap) used just inside the host and pods outside traffic-----> pod A____-> serves some content from the same pod A ......................|.\ ......................|..\_____-> proxy pass to pod B, serves other content .......................\_______-> proxy pass to pod C, serves another content [alarm@alarm] ~$ podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES Pod A 1d7965b47046 localhost/podman-pause:4.5.1-1685940105 4 days ago Up 3 days 0.0.0.0:4443->80/tcp 41ccc54e19bc-infra 2a97a019744a docker.io/library/nginx:latest nginx -g daemon o... 4 days ago Up 3 days 0.0.0.0:4443->80/tcp ngcho-ngcho Pod B 10a86fe1a3f8 localhost/podman-pause:4.5.1-1685940105 5 weeks ago Up 3 days 172.17.254.1:8880->80/tcp 0aba7d4b6c3c-infra 68bfaa81554d docker.io/library/mariadb:latest mariadbd 5 weeks ago Up 3 days 172.17.254.1:8880->80/tcp civinfo-wiki-mariadb1 46b3e133fc1b docker.io/library/mediawiki:latest apache2-foregroun... 5 weeks ago Up 3 days 172.17.254.1:8880->80/tcp civinfo-wiki-wiki1 Pod C 0864724f1013 localhost/podman-pause:4.5.1-1685940105 8 days ago Up 3 days 172.17.254.1:9080->80/tcp 379308f82e1f-infra 1e7ce6edf3b4 docker.io/library/nextcloud:fpm-alpine php-fpm 8 days ago Up 3 days 172.17.254.1:9080->80/tcp npriv 98c3d0551f00 docker.io/library/nginx:latest nginx -g daemon o... 8 days ago Up 3 days 172.17.254.1:9080->80/tcp ngnr 43fbc5ba7e76 docker.io/library/redis:alpine redis-server 8 days ago Up 3 days 172.17.254.1:9080->80/tcp relock [alarm@alarm] ~$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: localer: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 link/ether b6:09:51:d5:e7:ab brd ff:ff:ff:ff:ff:ff inet 172.17.254.1/32 scope global localer valid_lft forever preferred_lft forever 3: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 02:ba:60:54:22:77 brd ff:ff:ff:ff:ff:ff inet 192.168.1.34/24 brd 192.168.1.255 scope global end0 valid_lft forever preferred_lft forever inet6 fe80::ba:60ff:fe54:2277/64 scope link proto kernel_ll valid_lft forever preferred_lft forever >>>>>> ip in pod A, listening on 0.0.0.0:4443 [alarm@alarm] ~$ podman unshare nsenter --net=$(podman container inspect --format {{.NetworkSettings.SandboxKey}} 41ccc54e19bc-infra) [root@alarm] ~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host proto kernel_lo valid_lft forever preferred_lft forever 2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000 link/ether 1a:5f:e4:12:0e:85 brd ff:ff:ff:ff:ff:ff inet 192.168.1.34/24 brd 192.168.1.255 scope global end0 valid_lft forever preferred_lft forever inet6 fe80::185f:e4ff:fe12:e85/64 scope link proto kernel_ll valid_lft forever preferred_lft forever >>>>>> ip in pod B, listening on 172.17.254.1:8880 [alarm@alarm] ~$ podman unshare nsenter --net=$(podman container inspect --format {{.NetworkSettings.SandboxKey}} 0aba7d4b6c3c-infra) [root@alarm] ~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host proto kernel_lo valid_lft forever preferred_lft forever 2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000 link/ether ee:d8:91:8c:81:ec brd ff:ff:ff:ff:ff:ff inet 192.168.1.34/24 brd 192.168.1.255 scope global end0 valid_lft forever preferred_lft forever inet6 fe80::ecd8:91ff:fe8c:81ec/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [root@alarm] ~# ip r default via 192.168.1.1 dev end0 proto static 192.168.1.0/24 dev end0 proto kernel scope link src 192.168.1.34 >>>>> ip in pod C, listening on 172.17.254.1:9080 [alarm@alarm] ~$ podman unshare nsenter --net=$(podman container inspect --format {{.NetworkSettings.SandboxKey}} 379308f82e1f-infra) [root@alarm] ~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host proto kernel_lo valid_lft forever preferred_lft forever 2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000 link/ether 06:3d:e5:41:a7:1c brd ff:ff:ff:ff:ff:ff inet 192.168.1.34/24 brd 192.168.1.255 scope global end0 valid_lft forever preferred_lft forever inet6 fe80::43d:e5ff:fe41:a71c/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [root@alarm] ~# ip r default via 192.168.1.1 dev end0 proto static 192.168.1.0/24 dev end0 proto kernel scope link src 192.168.1.34 >>>>>>>>>> just for sanity - the recover script [alarm@alarm] ~$ cat recover.sh #!/bin/bash echo recover invoked echo >> /tmp/didirecover echo had to recover >> /tmp/didirecover date >> /tmp/didirecover echo >> /tmp/didirecover echo ip from the pod namespace >> /tmp/didirecover echo fib_trie >> /tmp/didirecover podman exec ngcho-ngcho cat /proc/net/fib_trie >> /tmp/didirecover echo route >> /tmp/didirecover podman exec ngcho-ngcho cat /proc/net/route >> /tmp/didirecover echo check egress from the container >> /tmp/didirecover podman exec ngcho-ngcho curl -m 6.0 -sv https://passt.top 2>> /tmp/didirecover echo will copy the now log to a dedicated file >> /tmp/didirecover cp /tmp/pasterngcho.log /tmp/pasterengcho-just-noticed-regression.log podman pod stop ngcho podman pod stop nextpriv podman pod stop civinfo-wiki podman pod stop ngcho podman pod stop nextpriv podman pod stop civinfo-wiki echo >> /tmp/didirecover echo pasta processes after stoping pods >> /tmp/didirecover ps -ef |grep pasta >> /tmp/didirecover pkill pasta podman pod start ngcho podman pod start civinfo-wiki /home/alarm/rconfs/nextpriv/start.me echo >> /tmp/didirecover echo pasta processes after recovery >> /tmp/didirecover ps -ef|grep pasta >> /tmp/didirecover echo >> /tmp/didirecover
Thanks for the additional information. I think I understand the setup better now. * Pod A ("ngcho") accepts requests from the outside world on port 4443 * For certain requests pod A needs to make further requests to pods B ("civinfo-wiki") and C ("netxpriv") * Those onward requests are made to address 172.17.254.1 ports 8880 and 9080 Based on the new information I can make a few preliminary observations 1) [trivial] You might be able to simplify your recovery script if you use these lines at the beginning: set -e exec >> /tmp/didirecover 2>&1 The "set -e" means the shell will echo each command as its executed, making it clearer which output goes with which command without having to add a bunch of additional echo commands. The exec + redirects sends all the output (stdout and stderr) to the log file, so you don't need explicit redirects on every command. 2) Although the requests from pod A to pods B & C are being sent do address 172.17.254.1, pods B&C will see them as arriving to address 192.168.1.34, since that's their IP address. That's not necessarily a problem, pasta will automatically perform this NAT, but it is a bit counter-intuitive. 3) For the communication between pods (rather than between pods and the outside world) pasta may not be a great solution. pasta's primary aim is to allow pod<->host and thereby pod<->world communications without host privilege. private networks between pods are possible with existing features and shouldn't require host privilege. So, it might be a more natural solution to create a private network between pods using veth+bridge devices with the 172.17.x.y addresses. The pods can have a secondary network interface based on pasta purely for communication with the outside world. Pod A definitely needs this, B+C may or may not, but only pod A would need port forwarding communication - requests from pod A to pods B&C would go over the private bridge instead. I'm not familiar enough with the details of network configuration in podman to know how to set that up. 4) There also may be a simpler way to set this up using just pasta interfaces, and not requiring the host 'localer' interface. By default podman uses pasta's --no-map-gw option, which means the host's primary IP isn't accessible from the pods, however that can be overridden. If --map-gw is used, then rather than using a secondary address from the 'localer' interface, pods B&C can be set up to listen on 127.0.0.1. Pod A can reach those services via the address of the host's default gateway. Unfortunately, none of those observations actually point towards a cause of the original problem here stopping. It's possible that implementing approaches (3) or (4) would stop the problem, but I have no particular reason to suspect they would, and if they did it would just be avoiding the problem, not actually solving it. I'm also still a bit confused by the role of the 'localer' device. Usually the way a tuntap device works is it has a network interface on one side, and a character device on the other. You use socket calls on the netif side, and read and write raw ethernet or IP frames on the character device side. It's not clear to me what's creating the localer interface here, and what's on the character device side of it. In any case, I don't think any traffic is actually going over the tap interface (from these pods, anyway), its only function in this setup AFAICT is to add an extra IP address to the host, which could equally be done with a dummy interface, or even just adding an IP to the main existing interface. Again, I'm not sure any of that is relevant to the original problem here, although depending on what is on the chardev side of 'localer' maybe that could be confusing pasta somehow. Next steps: a) If you can tell me how the 'localer' interface is created and used on the host side, that might shed some light on things. b) If you are able to attempt approaches (3) and/or (4) above, it would be useful to see if you still encounter the problem. If it does, it might lead us to a simpler reproducing case, if it doesn't it gives some clues as to what might be triggering the problem.
Created attachment 24 [details] pasta debug log moments after regression
Just attached pasta debug log few moments (max 10 minutes) after another regression. >>> Time, when the regression is observed (script runs every 10 minutes) Wed Aug 30 13:40:17 EEST 2023 >>> the affected pod does not have outbound connectivity check egress from the container * Resolving timed out after 6000 milliseconds * Closing connection 0 >>> the attached log 'pasta debug log moments after regression' will copy the now log to a dedicated file >>> the network ns id of the affected pod pasta processes after stoping pods alarm 260286 1 0 Aug28 ? 00:06:27 /usr/bin/pasta --config-net -t 4443-4443:80-80 -l /tmp/pasterngcho.log --debug -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-fe6fba3f-4fd0-4b5c-a223-7cde6431741e
Thanks for the new log. I've had a look, unfortunately I can't spot any useful clues in it. From the other symptoms, I suspect we're getting into an infinite loop "locally" - that is to say in such a way that we're not returning to the main epoll loop at all. Along with that I suspect we're simply not sending any more log messages once the problem occurs. Unfortunately, I don't have any good theories on exactly *where* we're getting stuck. It's time for some heavier tools. I've made a special purpose branch of passt here: https://gitlab.com/dgibson/passt/-/tree/bu68?ref_type=heads For now this makes two changes: it changes the default compile options to include debugging symbols, and it prevents this version from clearing the DUMPABLE flag, meaning we can attach to it with debuggers. Can you please: 1. Clone the git tree from that branch 2. make clean (This is to remove any existing objects built without debug symbols) 3. make (To rebuild with the debug symbols) 4. Install the newly built passt and passt.avx2 binaries to wherever you need so that podman uses them 5. Install gdb on your machine 6. Alter your recovery script so that before killing the stale pasta process, it executes: gdb -p $PID /path/to/passt.avx2 --batch -ex 'bt' Here $PID is the pid of the running, stalled pasta process, and /path/to/passt.avx2 is the full path to the binary you've built from the debugging branch. The idea of the above is that we'll get a backtrace of where in the code the stuck pasta process is executing. With a bit of luck that should provide a more concrete clue as to what's going on here. The extra information (a & b) requested in comment 11 would also be useful.
Sorry, overlooked and missed a & b in comment 11 >>> a) created via systemd-network: cat /etc/systemd/network/02-taper.netdev [NetDev] Name=localer Kind=tap cat /etc/systemd/network/03-localer.network [Match] Name=localer #[Link] #RequiredForOnline=no [Network] #DHCPv6=no ConfigureWithoutCarrier=true Address=172.17.254.1/32 >>> b) Great suggestions, definitely will dive deep in them, as i tried very similar concept with my existing in that time knowledge. So will improve my knowledge. Huge Thanks for pointing me, really appreciate. I don't mind to use pasta just for the POD A, if that will simplify digging in the regression (till be able to implement the better approach, as you pointed me in it) >>>b68 branch from: pasta --version pasta 2023_06_27.289301b-47-g5f1fcff now in use: pasta --version pasta 2023_08_23.a7e4bfb-1-gfedd2b6 p.s. there is some issue with my gdb complains about loading shared libraries. Still get some backtraces though. >>sample bt (while pod is running good) 0x0000ffff9a85a728 in ?? () #0 0x0000ffff9a85a728 in ?? () #1 0x0000fffff9fb1410 in ?? () Backtrace stopped: not enough registers or memory available to unwind further [Inferior 1 (process 279133) detached] >> gdb in the recover script no passt.avx2 after building just passt one: file passt passt: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, BuildID[sha1]=d03a642201957c5e8067582ef47ea76759c9b743, for GNU/Linux 3.7.0, with debug_info, not stripped As was not able timely to fix the gdb complains, at least will gather the gdb logs as the user, as the root, within podman unshare. Immediately when the regression is observed, and after the pod is stoped. >>> invoking gdb within the scrpit OFPP=$(pgrep -f "pasta.*debug") podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-podman-unshre-$(date).log" gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log" sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-sudo-$(date).log"
(In reply to jlesev from comment #15) > Sorry, overlooked and missed a & b in comment 11 > >>> a) created via systemd-network: > cat /etc/systemd/network/02-taper.netdev > [NetDev] > Name=localer > Kind=tap > > cat /etc/systemd/network/03-localer.network > [Match] > Name=localer > > #[Link] > #RequiredForOnline=no > > [Network] > #DHCPv6=no > ConfigureWithoutCarrier=true > Address=172.17.254.1/32 <reads up on systemd-networkd>. Ok.. unless there's some other component connecting to the chardev side, I think this is basically just making a disconnected tap device, which is.. fairly pointless. I believe using 'Kind=dummy' should work, without creating an essentially unused tap device. Alternatively you could change the network configuration on your main interface to add the 172.17 IP, which AFAICT is the only relevant thing this is actually doing. > >>> b) > Great suggestions, definitely will dive deep in them, as i tried very > similar concept with my existing in that time knowledge. So will improve my > knowledge. Huge Thanks for pointing me, really appreciate. I don't mind to > use pasta just for the POD A, if that will simplify digging in the > regression (till be able to implement the better approach, as you pointed me > in it) Ok, sounds good, no rush. Simplifying the reproducing situation is certainly valuable though: it gives a better chance to figure out what the crucial details are to reproduce this. > >>>b68 branch > from: > pasta --version > pasta 2023_06_27.289301b-47-g5f1fcff > > > now in use: > pasta --version > pasta 2023_08_23.a7e4bfb-1-gfedd2b6 > > > p.s. there is some issue with my gdb complains about loading shared > libraries. That's fairly normal - it means we won't be able to resolve backtraces into those libraries, but at this stage I don't think that will be necessary. > Still get some backtraces though. > >>sample bt (while pod is running good) > 0x0000ffff9a85a728 in ?? () > #0 0x0000ffff9a85a728 in ?? () > #1 0x0000fffff9fb1410 in ?? () > Backtrace stopped: not enough registers or memory available to unwind further > [Inferior 1 (process 279133) detached] Hmm.. the fact that it wasn't able to resolve any symbols in that trace (not even main()) is concerning. With an idle, working pasta we're probably sitting in the epoll_wait() call most of the time. So without symbols for libc or other libraries, main's probably the only one that could be resolve. It's possible some detail of arm means it can't be, but we might still get results from the "real" case, where I expect a deeper call stack. That said, while I don't think we need symbols for every shared library, it might be worth installing 'glibc-debuginfo' so that we can resolve at least libc symbols > >> gdb in the recover script > no passt.avx2 after building just passt one: Oh, sorry, I forgot this was arm, not x86_64 - the avx2 variant is x86 specific. Just copying the passt binary is sufficient then. > file passt > passt: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), > dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, > BuildID[sha1]=d03a642201957c5e8067582ef47ea76759c9b743, for GNU/Linux 3.7.0, > with debug_info, not stripped > > As was not able timely to fix the gdb complains, at least will gather the > gdb logs as the user, as the root, within podman unshare. Ah, yes, Again, I'm not super familiar with all the details of podman, so I guess you need that to be in the right context to debug the pasta process it starts. > Immediately when > the regression is observed, and after the pod is stoped. > > >>> invoking gdb within the scrpit > OFPP=$(pgrep -f "pasta.*debug") > podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> > "/tmp/gdb-podman-unshre-$(date).log" > gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log" > sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> > "/tmp/gdb-sudo-$(date).log" You seem to have OFPP in one place then $FFOO in another - was that just a transcription error into the bug, or do you need to correct your script? I _think_ you only need the podman unshare version, not the others. I guess we'll see what the results are.
Hello - simplified the setup - just one pasta pod. Changed tap to dummy (10x). I have few reasons to avoid -map-gaw, as the host itself is listening to non relevant for the pods services, for which i'd like to keep just the enough possibility to enter. So in short same topology, despite just one pasta and kind dummy for the localer interface - did not manage to overcome the shared library gdb. So recompiled with make static - fixed the typo for the PID variable >>> snippets: grep FFOO ./recover.sh FFOO=$(pgrep -f "pasta.*debug") echo offending pod PID is $FFOO >> /tmp/didirecover podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-podman-unshre-$(date).log" gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log" sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-sudo-$(date).log" podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-podman-unshre-$(date).log" gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-$(date).log" all of them now produced this: warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container. 0x000000000046fc38 in epoll_pwait () #0 0x000000000046fc38 in epoll_pwait () #1 0x000000000040080c in main (argc=<optimized out>, argv=<optimized out>) at passt.c:302 [Inferior 1 (process 5735) detached]
Created attachment 25 [details] gdb and pasta debug log
Just added gdb and pasta debug logs - attachment: gdb and pasta debug log >>> gethreing them with comments >#just after catching the regerss pod is still running cp /tmp/pasterngcho.log "/tmp/pasterengcho-just-noticed-regression-$(date).log" FFOO=$(pgrep -f "pasta.*debug") echo offending pod PID is $FFOO >> /tmp/didirecover podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-podman-unshre-$(date).log" gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log" sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-sudo-$(date).log" >##the above gdb backtraces are with pod running >#at this point pod is already stopped echo pasta processes after stoping pods >> /tmp/didirecover ps -ef |grep pasta >> /tmp/didirecover >>>>### the existence of passta process validates, that observation is really the regresion (avoid scenarios, when probably it is just the dns and internet issue) echo will gather gdb logs again, as just the pasta process is here podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-podman-unshre-$(date).log" gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-$(date).log" sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-just-pasta-sudo-$(date).log" >### the above gdb backtraces are with pod stopped, and still persistent pasta process
Created attachment 26 [details] gdb-logs-2023-09-04
(In reply to jlesev from comment #17) > Hello > - simplified the setup - just one pasta pod. Ok, great. How are you doing the communication between the pods now? With a bridge device? > Changed tap to dummy (10x). I > have few reasons to avoid -map-gaw, as the host itself is listening to non > relevant for the pods services, for which i'd like to keep just the enough > possibility to enter. Sorry, I don't quite understand what you're saying here. > So in short same topology, despite just one pasta and > kind dummy for the localer interface > > - did not manage to overcome the shared library gdb. So recompiled with > make static > - fixed the typo for the PID variable > >>> snippets: > grep FFOO ./recover.sh > FFOO=$(pgrep -f "pasta.*debug") > echo offending pod PID is $FFOO >> /tmp/didirecover > podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> > "/tmp/gdb-podman-unshre-$(date).log" > gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> "/tmp/gdb-$(date).log" > sudo gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> > "/tmp/gdb-sudo-$(date).log" > podman unshare gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> > "/tmp/gdb-just-pasta-podman-unshre-$(date).log" > gdb -p $FFOO /usr/bin/passt --batch -ex 'bt' >> > "/tmp/gdb-just-pasta-$(date).log" > > all of them now produced this: I'm not sure what you mean by "them" here. Do you mean running each of these gdb variants on an idle/working pasta? > warning: Target and debugger are in different PID namespaces; thread lists > and other data are likely unreliable. Connect to gdbserver inside the > container. > 0x000000000046fc38 in epoll_pwait () > #0 0x000000000046fc38 in epoll_pwait () > #1 0x000000000040080c in main (argc=<optimized out>, argv=<optimized out>) > at passt.c:302 > [Inferior 1 (process 5735) detached] If this is a working pasta this is more or less expected - we're waiting in epoll_wait() for the next event to respond to.
Thanks for the new logs with gdb traces. AFAICT all the methods you used to invoke gdb worked just fine - that turns out to be useful, because it's effectively given us several backtraces a few seconds apart, which gives us a bit more information about what's going on. I can't draw a conclusion from the information yet, but it looks like we're getting into an infinite loop within tcp_tap_handler(). My next step is to add some more focused instrumentation there to figure out why.
Ok, after more detailed examination of the gdb traces, my best guess is that we're entering an infinite (or at least, far too long) loop in tap4_handler() around the call to tcp_tap_handler(). It's possible that's caused by some kind of underflow on the packet count. I've added some specific instrumentation to check this, and I've made a new branch including it here: https://gitlab.com/dgibson/passt/-/tree/bug68?ref_type=heads [Note this is a different branch name from previous - I corrected the typo I made last type] Could you please build and install this new pasta and send me the logs after the next failure. This time it's the debug log from pasta that's interesting, rather than the gdb traces that's of most interest.
>> pasta pasta compiled with commit b28e9fc3e49a442893c19e26220335b3fac3d829 Still compiled as static binary - using "make static", to be able to provide both kinds of information: pasta debug log and gdb backtraces >> pod topology All the three pods are rootless. POD B and POD C are now using the netavark network type. Their ports are exposed just to the host dummy interface. POD A is using the passt. It access POD B and POD C via the host dummy interface:port IF i start the pasta pod with -map-gw - As i understood this means the containers in pod will be able to access every network service listening on the host, by creating/attempting connection to their default gw. I have run some test to validate my understanding. Containers - being able to access network service listening on the host, especially listening on the host 127.0.0.1 is something i am not ready for. That's why i am still using the host dummy interface, instead mapping pod B and pod C to the host 127.0.0.1 and access their service within pod A (if started with -map-gw) via the gateway:port
(In reply to jlesev from comment #24) > >> pasta > pasta compiled with commit b28e9fc3e49a442893c19e26220335b3fac3d829 > Still compiled as static binary - using "make static", to be able to provide > both kinds of information: pasta debug log and gdb backtraces Ok, great. > >> pod topology > All the three pods are rootless. > POD B and POD C are now using the netavark network type. Their ports are > exposed just to the host dummy interface. > POD A is using the passt. It access POD B and POD C via the host dummy > interface:port Ok. I'm not familiar with netavark, but at this point I don't think it really matters. It looks very much like whatever's going wrong is entirely within the the pod A pasta, and the network of the other pods doesn't matter that much. > IF i start the pasta pod with -map-gw - As i understood this means the > containers in pod will be able to access every network service listening on > the host, by creating/attempting connection to their default gw. I have run > some test to validate my understanding. Containers - being able to access > network service listening on the host, especially listening on the host > 127.0.0.1 is something i am not ready for. That's why i am still using the > host dummy interface, instead mapping pod B and pod C to the host 127.0.0.1 > and access their service within pod A (if started with -map-gw) via the > gateway:port Ok, that's fair. --map-gw does expose the host's loopback interface to the pod. Do be aware that all IPs on the host except for loopback and the IP that pasta picks for the pod are exposed to the container, even with --no-map-gw. This includes IPs that would not effectively be accessible from externally, like the one on the dummy interface (and in fact you're relying on this). So what protection is afforded by --no-map-gw is fairly fragile.
Created attachment 27 [details] pasta debug logs, with TCP Sequence debug statements
Created attachment 28 [details] gdb backtraces from the regresion with pasta debug TCP Sequence statements
Added 2 attachments: gdb backtraces from the regresion with pasta debug TCP Sequence statements pasta debug logs, with TCP Sequence debug statements Regression is observed twice, once at "Mon" and once at "Tue". pasta debugs contain the 2 copied at the time pasta debug log gbd backtraces contains the 2 sets of the 6 gbd backtraces - 3 in the moment when pod and pasta are running, 3 in the moment when pod is stopped and pasta is still running. 10x for clarifying the -(no)map-gw. As well for your cooperation and professionalism.
Thanks for the new logs. They have confirmed that a) yes, we are definitely looping in the place I thought and b) the reason is an underflow of the counter of packets. I'm now looking at the code to see why that might happen. I think we're pretty close to pinning this down now. Not sure if I'll be able to figure it out from here, or if we'll need another round of specially instrumented pasta.
Small update. I've looked at the code and discussed with Stefano. We've spotted a number of bugs there, although I'm not yet certain exactly which combination of them is causing the symptoms you're seeing. Stefano and I are working out how to tackle the bugs, and in the meantime I hope to send another instrumented pasta version to check more precisely how we're getting into the situation you're seeing. I'm moderately confident that fixing the bugs we've seen will fix the problem here, but it would be good to really understand exactly how the symptoms you've described arose. If nothing else that might let us construct a minimal regression test case.
10x for your efforts. I am willing to help to dig down the rootcause. Will do happily tests with passta versions.
Ok, what I hope will be the last round of instrumentation to pin down the root cause is ready. You can download at https://gitlab.com/dgibson/passt/-/tree/bug68?ref_type=heads Note that the behaviour of this version should be a bit different. Instead of entering a (near) infinite loop when the problem occurs, it should detect the problem and exit instead. The point of this is that we don't flush out the debugging showing the origin of the problem with thousands of messages as we loop. With the revised version the debug logs should finish with the relevant logging of how the problem originates.
I've now posted draft fixes for a number of problems in tcp_tap_handler(), which I think will fix this bug amongst others. You can get them at: https://gitlab.com/dgibson/passt/-/tree/bug68c?ref_type=heads However, if you can keep running the previous version for a bit to get the extra instrumentation output, that would be still be helpful. As soon as you get a reproduce with that version, go ahead and try the proposed fix and see if it works.
Hello, Now using: 9536f3d bug68: root cause debugging I understood, that this instrumental passta will exit - to keep the debug log clean. However, just in case, i still compiled it with make static, so to be able to provide gdb backtraces even if the pasta does not exit as expected.
(In reply to jlesev from comment #34) > Hello, > Now using: 9536f3d bug68: root cause debugging > > I understood, that this instrumental passta will exit - to keep the debug > log clean. However, just in case, i still compiled it with make static, so > to be able to provide gdb backtraces even if the pasta does not exit as > expected. Ok. I don't think that matters, but it certainly won't hurt.
Created attachment 29 [details] pasta-debug log with the root cause bug68 branch
Thanks for the information, I think I've pieced together the scenario that was causing the problem. 1. In the batch of packets before the one with the underflow, we received a FIN, setting the TAP_FIN_RCVD event in tcp_data_from_tap(). We are now waiting for the sock->tap side of the connection to close (the CLOSE_WAIT TCP state, w.r.t. the tap side). 2. We haven't received a close from the socket side, so conn_event_do(TAP_FIN_RCVD) also sets the ACTIVE_CLOSE flag. 3. Upon return from tcp_data_from_tap() we shutdown() the write side of the socket, and set the SOCK_FIN_SENT event in tcp_tap_handler(). 4. We finish with the penultimate batch of packets 5. The socket side completes closing, causing a POLLRDHUP or POLLHUP epoll event, which causes the SOCK_FIN_RCVD event to be set on the connection in tcp_tap_sock_handler(). We are now waiting for the tap side to ack any last data we sent (the LAST-ACK state w.r.t. the tap side).. 6. We receive the final batch of packets from the tap interface. The first packet in the batch is the final ack we're expecting, then there is at least one additional packet. The extra packet could be a SYN for a new connection, it could be something out of order, or it could be due to a bug on the guest/container side - in any case it's beyond pasta's control. 7. In tcp_tap_handler() we process the final ack packet. We enter the TAP_FIN_RCVD subpath, since we already set that event. Since we also have SOCK_FIN_RCVD and this *is* the final ack, we close the connection. 8. Calling conn_event(.., CLOSED) clears all other events in conn_event_do(). 9. We now return 1 from tcp_tap_handler(). Because we haven't consumed all the packets in the batch, tcp_tap_handler() is called again. In this loop we now have n == (p->count - 1). 10. Due to a bug fixed in the series I've sent, rather than moving onto the next packet, tcp_tap_handler() examines the final ack packet again. 11. Due to another bug fixed in the series I've sent, tcp_tap_handler() matches the addresses and finds the connection, even though it's closed. 12. Because the FIN events were cleared when we CLOSED the connection, we don't enter the TAP_FIN_RCVD subpath this time, and instead proceed on to tcp_data_from_tap(). We return p->count from tcp_tap_handler(). 13. In the caller we subtract p->count from n, which was previously equal to p->count - 1, we subtract p->count and now have n == -1 == 2^64-1. We now loop essentially forever trying to reduce this to 0 again. (Except in the latest instrumented version we detect this situation and abort instead).
Ok, having identified the course of events which led to the bug, I'm even more confident that the fixes we've already merged should fix it. In fact it should fix it in at least 3 different ways: a) Rather than double processing the last ack, we should move onto the additional packets in the batch, which might start a new connection or trigger a TCP reset. (fixed by commit 043a70b885 "tcp, tap: Correctly advance through packets in tcp_tap_handler()"). b) Once we've fully closed the connection we won't consider any new packets as belonging ti it, so instead they'll be processed either as the start of a new connection, or stray packets which will trigger a reset (fixed by commit 5fb376de6e "tcp: Never hash match closed connections") c) Even if this or some other sequence of packets causes tcp_tap_handler() to incorrectly return a value greater than the number of remaining packets in the batch, this will just cause the loop in tap.c to exit, rather than underflowing and looping forever (also fixed vby commit 043a70b885 "tcp, tap: Correctly advance through packets in tcp_tap_handler()") . jlesev, can you please try the latest git master branch. I'm pretty sure the bug will be fixed, but it would be great to confirm it.
Sure. now using: commit 05627dc5127bee9f4df77351575572d6f4ce4c7e Author: Stefano Brivio <sbrivio@redhat.com> Date: Fri Sep 8 17:34:27 2023 +0200 pasta --version pasta 2023_09_08.05627dc will be able to provide logs if something goes wrong. I did not succeed with repeatable and timely manner to trigger the regression. So i will try to put more network load on the services and will share status every other day. I am open to suggestion too:)
Right, as always with slow to reproduce problems it's hard to be sure it's really fixed, rather than just you got lucky/unlucky this time. Given the previous frequency of the problem occurring, I think we could be reasonably confident once it's run for a week without problems.
No regression observed so far. About the network load: - continious : nmap scripts (loped) target the service with http request / methods - few times executed apache ab loads with variuos (16 - 256) simultanious connections - few times executed heavy nmap spider / crawlers - few times executed various online web page performance testers/analyzers I am heppily open for advises for any, specific, heavy test to stress the service.
Those loading approaches seem reasonable to me, I don't have any additional ideas.
Still going strong, no regression observed.
(In reply to jlesev from comment #43) > Still going strong, no regression observed. Good to hear. Looking promising.
Still going OK
Assume it's still going ok, I think we can now be pretty confident we've squashed the bug. If you confirm, I'll close this BZ.
Still going OK. I am confident the bug is squashed. You are amazing and doing great and valuable job. 10x
Thanks :). Closing this bug.