79 – pasta does not honour the --outbound flag for udp flows

Bug 79 - pasta does not honour the --outbound flag for udp flows

Summary: pasta does not honour the --outbound flag for udp flows

Status:	RESOLVED FIXED

Alias:	None

Product:	passt
Classification:	Unclassified
Component:	UDP (show other bugs)
Version:	unspecified
Hardware:	All Linux

Importance:	Normal normal
Assignee:	nobody

URL:

Depends on:
Blocks:

Reported:	2024-01-29 18:58 UTC by Laurent Jacquot
Modified:	2024-02-14 13:17 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
capture of flows in the container (2.35 KB, application/octet-stream) 2024-02-03 12:45 UTC, Laurent Jacquot	Details
log from pasta --trace (5.37 KB, text/plain) 2024-02-03 12:46 UTC, Laurent Jacquot	Details
Fix typo in udp_sock_init() (2.88 KB, patch) 2024-02-12 18:41 UTC, Laurent Jacquot	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Description Laurent Jacquot 2024-01-29 18:58:59 UTC

Hello
Pasta replies to udp connection with wrong ip address with ipv4 and ipv6. It works for tcp

I set up a pihole rootless container replying to a specific address (::14 for ipv6). Ipv4 is the same

the end-to-end connection is:
-any client sends a dns request to ::14/53
-nftable dnat ::14/53 to ::14/5300
-pasta gets the connection and give it to pihole (port 53)

ps -ef |grep pasta
pod       305095    2433  0 13:26 ?        00:00:00 /usr/bin/pasta --config-net -t 2a01:e0a:1f1:9700::14/5300-5300:53-53 -t 2a01:e0a:1f1:9700::14/8080-8080:80-80 -u 2a01:e0a:1f1:9700::14/5300-5300:53-53 -a 2a01:e0a:1f1:9700::f14 -n 2a01:e0a:1f1:9700::/64 -o 2a01:e0a:1f1:9700::14 -T none -U none --no-map-gw --netns /run/user/960/netns/netns-248aac3a-e84e-2f5b-3831-6e0dc2c9d027

relevant host ip config:
5: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8a:6a:b2:c5:17:39 brd ff:ff:ff:ff:ff:ff
     inet6 2a01:e0a:1f1:9700::4/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 2a01:e0a:1f1:9700::13/128 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 2a01:e0a:1f1:9700::14/128 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::1085:d326:7882:f671/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

container ip config
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 2a:36:3b:af:19:ad brd ff:ff:ff:ff:ff:ff
    inet6 2a01:e0a:1f1:9700::f14/64 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 fe80::2836:3bff:feaf:19ad/64 scope link 
       valid_lft forever preferred_lft forever

nftables relevant extract:
table ip6 ip6nat {
...
    chain prerouting {
        type nat hook prerouting priority -90; policy accept;
        ip6 daddr $PIHOLE6 tcp dport 80 dnat to $PIHOLE6:8080
        ip6 daddr $PIHOLE6 udp dport 53 dnat to PIHOLE6:5300
        ip6 daddr $PIHOLE6 tcp dport 53 dnat to $PIHOLE6:5300

    }
}

with tcp, everything works:
dig A free.fr @2a01:e0a:1f1:9700::14 +vc

; <<>> DiG 9.18.21 <<>> A free.fr @2a01:e0a:1f1:9700::14 +vc
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59945
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;free.fr.			IN	A

;; ANSWER SECTION:
free.fr.		3533	IN	A	212.27.48.10

;; Query time: 12 msec
;; SERVER: 2a01:e0a:1f1:9700::14#53(2a01:e0a:1f1:9700::14) (TCP)
;; WHEN: Mon Jan 29 19:41:42 CET 2024
;; MSG SIZE  rcvd: 52

but for udp pasta choose the wrong source ip address as nft monitor shows:

trace id f770f1b9 ip6 ip6nat prerouting rule ip6 saddr 2a01:e0a:1f1:9700:6109:3acb:d24:1621 meta nftrace set 1 (verdict continue)
trace id f770f1b9 ip6 ip6nat prerouting rule ip6 daddr 2a01:e0a:1f1:9700::14 udp dport 53 dnat to [2a01:e0a:1f1:9700::14]:5300 (verdict accept)
trace id f770f1b9 ip6 ip6nat in packet: iif "br0" ... ip6 saddr 2a01:e0a:1f1:9700:6109:3acb:d24:1621 ip6 daddr 2a01:e0a:1f1:9700::4 ip6 ... udp sport 59087 udp dport 5300 udp .. 
ttrace id f770f1b9 ip6 ip6nat in policy accept 
trace id 40c959d9 ip6 ip6nat out packet: oif "br0" ip6 saddr 2a01:e0a:1f1:9700::13 ip6 daddr 2a01:e0a:1f1:9700:6109:3acb:d24:1621 ... udp sport 5300 udp dport 59087 udp ...
==> wrong ip source and wrong source port

Comment 1 Stefano Brivio 2024-01-30 11:06:30 UTC

Laurent, thanks for reporting this.

I couldn't spot any obvious issue in the code yet, but I still have to check if binding UDP sockets for source address selection works like it does for TCP. I just took it for granted to be honest, but now I start having some doubts.

Comment 2 Stefano Brivio 2024-01-30 15:37:12 UTC

Binding to a specific outbound address for UDP actually works for me. With:

  $ pasta -o 2a01:4f8:222:904::2 --config-net
  # echo abcd | nc -u ${SERVER} 9199

on $SERVER:

  recvfrom(3, "abcd\n", 131072, MSG_PEEK, {sa_family=AF_INET6, sin6_port=htons(52692), inet_pton(AF_INET6, "2a01:4f8:222:904::2", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 5

and with:

  $ pasta -o 2a01:4f8:222:904::3 --config-net
  # echo abcd | nc -u ${SERVER} 9199

on $SERVER:

recvfrom(3, "abcd\n", 131072, MSG_PEEK, {sa_family=AF_INET6, sin6_port=htons(43305), inet_pton(AF_INET6, "2a01:4f8:222:904::3", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 5

Could you please try a test similar to this on your box, so that we can exclude any issue with the basic feature itself (in combination with e.g. a specific kernel version)?

Next, it would be nice to have the output of 'strace -e bind -p $(pidof pasta)' (or similar), started before the UDP request is forwarded (no need for the whole output from the start of the process). Note that pasta(1) can't be ptrace(2)d from any unprivileged user, you'll need to run strace(1) as root.

Otherwise, I can try something similar to your setup, but that will take a bit longer on my side.

Comment 3 Laurent Jacquot 2024-01-31 18:52:01 UTC

Hello,
the test for an outbound initiated connection works for me:
(I have some EPERM thought)

alex@jack:~$pasta -o 2a01:e0a:1f1:9700::14 --config-net
Couldn't mount /proc: Permission denied
basename: opérande manquant
root@jack:~$echo abcd | nc -u  2a01:e0c:1::1 53


and on the wire:
root@jack:~#tcpdump -n -i br0 host 2a01:e0c:1::1
19:45:34.477575 IP6 2a01:e0a:1f1:9700::14.34976 > 2a01:e0c:1::1.domain: domain [length 5 < 12] (invalid)
19:45:34.479590 IP6 2a01:e0c:1::1 > 2a01:e0a:1f1:9700::14: ICMP6, destination unreachable, unreachable port, 2a01:e0c:1::1 udp port domain, length 61

I'll try to strace as soon as I can.
thanks!

Comment 4 Laurent Jacquot 2024-01-31 19:03:37 UTC

BTW this works also from the incriminated container 

pod@jack:~$podman exec -ti pihole bash
root@pihole:/# echo abcd | nc -u  2a01:e0c:1::1 53

tcpdump -n -i br0 host 2a01:e0c:1::1
20:01:41.854911 IP6 2a01:e0a:1f1:9700::14.39511 > 2a01:e0c:1::1.domain: domain [length 5 < 12] (invalid)
20:01:41.857467 IP6 2a01:e0c:1::1 > 2a01:e0a:1f1:9700::14: ICMP6, destination unreachable, unreachable port, 2a01:e0c:1::1 udp port domain, length 61

Comment 5 Stefano Brivio 2024-02-01 23:42:49 UTC

(In reply to Laurent Jacquot from comment #3)
> Hello,
> the test for an outbound initiated connection works for me:
> (I have some EPERM thought)
> 
> alex@jack:~$pasta -o 2a01:e0a:1f1:9700::14 --config-net
> Couldn't mount /proc: Permission denied
Uh-oh. It might be a LSM policy issue such as one I recently fixed for AppArmor (https://passt.top/passt/commit?id=abf5ef6c22d2e6fce0f0abe398a2c18b70ca6290). Which distribution is this?

(In reply to Laurent Jacquot from comment #4)
> BTW this works also from the incriminated container 
> 
> pod@jack:~$podman exec -ti pihole bash
> root@pihole:/# echo abcd | nc -u  2a01:e0c:1::1 53
> 
> tcpdump -n -i br0 host 2a01:e0c:1::1
> 20:01:41.854911 IP6 2a01:e0a:1f1:9700::14.39511 > 2a01:e0c:1::1.domain:
> domain [length 5 < 12] (invalid)
The plot thickens -- what's the difference between this and the actual DNS request? Yes, at this point I hope strace of bind() can tell us something more.

Comment 6 Laurent Jacquot 2024-02-02 12:53:33 UTC

Hello Stefano
thank you for your quick responses!

In fact, everything worked, including remote dns sollicitation.

But i rebooted and now the problem is back. This seems random: the ip choosen is now ::4

pod@jack:~$podman exec -ti pihole bash
root@pihole:/# dig A free.fr @2a01:e0c:1:1599::22

and on the wire:

tcpdump -n -i br0 host 2a01:e0c:1:1599::22
13:37:17.121015 IP6 2a01:e0a:1f1:9700::4.37932 > 2a01:e0c:1:1599::22.domain: 4441+ [1au] A? free.fr. (48)
13:37:17.123268 IP6 2a01:e0c:1:1599::22.domain > 2a01:e0a:1f1:9700::4.37932: 4441*- 1/0/1 A 212.27.48.10 (52)


but strace -p shows the right binding

poll_wait(4, [], 8, 1000)              = 0
epoll_wait(4, [{events=EPOLLIN, data={u32=3848, u64=3848}}], 8, 1000) = 1
read(15, "\230\267\205\0)\303\252\201\213Jc\220\206\335`\v\244S\08\21@*\1\16\n\1\361\227\0\0\0"..., 8388608) = 110
read(15, 0x55ef4b4e606e, 8388498)       = -1 EAGAIN (Ressource temporairement non disponible)
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 37
setsockopt(37, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(37, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(37, {sa_family=AF_INET6, sin6_port=htons(37932), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2a01:e0a:1f1:9700::14", &sin6_addr), sin6_scope_id=0}, 28) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 37, {events=EPOLLIN, data={u32=9476, u64=288393292851193092}}) = 0
sendmmsg(37, [{msg_hdr={msg_name={sa_family=AF_INET6, sin6_port=htons(53), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2a01:e0c:1:1599::22", &sin6_addr), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="\21Y\1 \0\1\0\0\0\0\0\1\4free\2fr\0\0\1\0\1\0\0)\20\0\0\0"..., iov_len=48}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=48}], 1, MSG_NOSIGNAL) = 1
epoll_wait(4, [{events=EPOLLIN, data={u32=9476, u64=288393292851193092}}], 8, 1000) = 1
recvmmsg(37, [{msg_hdr={msg_name={sa_family=AF_INET6, sin6_port=htons(53), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2a01:e0c:1:1599::22", &sin6_addr), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="\21Y\205\0\0\1\0\1\0\0\0\1\4free\2fr\0\0\1\0\1\300\f\0\1\0\1\0"..., iov_len=65487}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=52}], 1, 0, NULL) = 1
write(15, "\252\201\213Jc\220\230\267\205\0)\303\206\335`\0\0\0\0<\21\377*\1\16\f\0\1\25\231\0\0"..., 114) = 114
epoll_wait(4, [], 8, 1000)              = 0


This is weird, I do not have any dnat / masquerade from the container to the world


when using a remote client here is the strace, it shows EPERM when trying to sendmmsg

epoll_wait(4, [{events=EPOLLIN, data={u32=4356, u64=504684860875477252}}], 8, 1000) = 1
recvmmsg(17, [{msg_hdr={msg_name={sa_family=AF_INET6, sin6_port=htons(52783), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78", &sin6_addr), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="\0\23\1\0\0\1\0\0\0\0\0\0\4free\2fr\5lutty\3net\0\0"..., iov_len=65487}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=35}], 1, 0, NULL) = 1
write(15, "\252\201\213Jc\220\230\267\205\0)\303\206\335`\0\0\0\0+\21\377*\1\16\n\1\361\227\0$["..., 97) = 97
epoll_wait(4, [{events=EPOLLIN, data={u32=3848, u64=3848}}], 8, 1000) = 1
read(15, "\230\267\205\0)\303\252\201\213Jc\220\206\335`\3\23K\0+\21@*\1\16\n\1\361\227\0\0\0"..., 8388608) = 97
read(15, 0x55ef4b4e6061, 8388511)       = -1 EAGAIN (Ressource temporairement non disponible)
sendmmsg(17, [{msg_hdr={msg_name={sa_family=AF_INET6, sin6_port=htons(52783), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78", &sin6_addr), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="\0\23\201\203\0\1\0\0\0\0\0\0\4free\2fr\5lutty\3net\0\0"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}}], 1, MSG_NOSIGNAL) = -1 EPERM (Opération non permise)
epoll_wait(4, [], 8, 1000)              = 0


My setup is the latest fedora fully updated, whith selinux enabled. The container is started by a dedicated user (pod) using podman and started by systemd (systemctl --user start pihole)

the cmd line is:

/usr/bin/podman run \
        --cidfile=%t/%n.ctr-id \
        --cgroups=no-conmon \
        --rm \
        --sdnotify=conmon \
        -d \
        --replace \
        --name=pihole \
        --hostname=pihole \
        --net=pasta:-a,192.168.23.14,-n,255.255.255.0,-a,2a01:e0a:1f1:9700::f14,-n,2a01:e0a:1f1:9700::/64,-o,192.168.22.14,-o,2a01:e0a:1f1:9700::14 \
        --dns=127.0.0.1 \
        --dns=192.168.22.4 \
        --dns=2a01:e0a:1f1:9700::4 \
        --dns-search=lutty.net \
        -v pihole_pihole:/etc/pihole:Z \
        -v pihole_dnsmasq:/etc/dnsmasq.d:Z \
        --label "io.containers.autoupdate=registry" \
        -p 192.168.22.14:8080:80/tcp \
        -p 192.168.22.14:5300:53/tcp \
        -p 192.168.22.14:5300:53/udp \
        -p [2a01:e0a:1f1:9700::14]:8080:80/tcp \
        -p [2a01:e0a:1f1:9700::14]:5300:53/tcp \
        -p [2a01:e0a:1f1:9700::14]:5300:53/udp \
        docker.io/pihole/pihole

Comment 7 Laurent Jacquot 2024-02-02 15:14:47 UTC

This is strange, sendmmsg should not be able to emit an EPERM error:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html

Maybe completely irrelevant, but people from the quic-go project work around it by a retry:
https://github.com/quic-go/quic-go/pull/4111

Comment 8 Stefano Brivio 2024-02-02 15:41:38 UTC

(In reply to Laurent Jacquot from comment #7)
> This is strange, sendmmsg should not be able to emit an EPERM error:
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html
> 
> Maybe completely irrelevant, but people from the quic-go project work around
> it by a retry:
> https://github.com/quic-go/quic-go/pull/4111
EPERM on sendmsg() is usually given (yes, I know, it's not specified, but it makes somewhat sense) because netfilter would drop the packet -- as if the kernel reported that the process is not allowed to send a given packet. See nft_do_chain() in net/netfilter/nf_tables_core.c (not the only path I guess):

        switch (regs.verdict.code & NF_VERDICT_MASK) {
        case NF_ACCEPT:
        case NF_QUEUE:
        case NF_STOLEN:
                return regs.verdict.code;
        case NF_DROP:
                return NF_DROP_REASON(pkt->skb, SKB_DROP_REASON_NETFILTER_DROP, EPERM);
        }

Could you check if that's the case with 'nft list ruleset' once you get that error?

I'll check the rest of what you shared later.

Comment 9 Laurent Jacquot 2024-02-02 17:42:40 UTC

Ok, I think we can discard previous messages: I stopped nftables and I do not see sndmmsg eperm anymore.
Thank for pointing out a problem in my ruleset :-) I'll debug that later on.

but the problem is still there: pasta sends replies that the client does not see.

tcpdump thinks the reply comes from ::13 but this does not seem reliable.

tcpdump -nn -i br0 host 2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78
18:26:05.532847 IP6 2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78.57673 > 2a01:e0a:1f1:9700::14.5300: UDP, length 48
18:26:05.534349 IP6 2a01:e0a:1f1:9700::13.5300 > 2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78.57673: UDP, length 52

and here is the strace, where recvmmsg and sendmmsg only show the client:

epoll_wait(4, [], 8, 1000)              = 0
epoll_wait(4, [{events=EPOLLIN, data={u32=2820, u64=504684860875475716}}], 8, 1000) = 1
recvmmsg(11, [{msg_hdr={msg_name={sa_family=AF_INET6, sin6_port=htons(57673), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78", &sin6_addr), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="\30\32\1 \0\1\0\0\0\0\0\1\4free\2fr\0\0\1\0\1\0\0)\20\0\0\0"..., iov_len=65487}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=48}], 1, 0, NULL) = 1
write(15, "v\0047\2450\276\212j\262\305\279\206\335`\0\0\0\08\21\377*\1\16\n\1\361\227\0$["..., 110) = 110
epoll_wait(4, [{events=EPOLLIN, data={u32=3848, u64=3848}}], 8, 1000) = 1
read(15, "\212j\262\305\279v\0047\2450\276\206\335`\17hP\0<\21@*\1\16\n\1\361\227\0\0\0"..., 8388608) = 114
read(15, 0x563ad342f072, 8388494)       = -1 EAGAIN (Ressource temporairement non disponible)
sendmmsg(17, [{msg_hdr={msg_name={sa_family=AF_INET6, sin6_port=htons(57673), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78", &sin6_addr), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="\30\32\201\200\0\1\0\1\0\0\0\1\4free\2fr\0\0\1\0\1\300\f\0\1\0\1\0"..., iov_len=52}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=52}], 1, MSG_NOSIGNAL) = 1
epoll_wait(4, [], 8, 1000)              = 0

I dunno how to debug further.
I can recompile pasta with custom patches if needed..

Comment 10 Laurent Jacquot 2024-02-02 18:45:40 UTC

Hi I'm able to reproduce the bug with a simplified setup

no nftables
no selinux (setenforce 0)

pod@jack:~$pasta --config-net -o 2a01:e0a:1f1:9700::14  -u 5300
pod@jack:~#dnsmasq -C /dev/null -h -k -d -q -R --server=2a01:e0c:1:1599::22 -p 5300

from the container everything works

on a remote client dig -p 5300 A free.fr @2a01:e0a:1f1:9700::14

dnsmasq sees the query
dnsmasq: query[A] free.fr from 2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78

but no reply

Comment 11 Stefano Brivio 2024-02-02 21:14:22 UTC

(In reply to Laurent Jacquot from comment #10)
> Hi I'm able to reproduce the bug with a simplified setup
> 
> no nftables
> no selinux (setenforce 0)
> 
> pod@jack:~$pasta --config-net -o 2a01:e0a:1f1:9700::14  -u 5300
> pod@jack:~#dnsmasq -C /dev/null -h -k -d -q -R --server=2a01:e0c:1:1599::22
> -p 5300
> 
> from the container everything works
> 
> on a remote client dig -p 5300 A free.fr @2a01:e0a:1f1:9700::14
> 
> dnsmasq sees the query
> dnsmasq: query[A] free.fr from 2a01:e0a:1f1:9700:245b:a0ca:1a01:6e78
> 
> but no reply
Ah, great, thanks, assuming it's the same problem, I'll try to reproduce it this way.

One question though: does dnsmasq reply to the query, or it's invalid to start with? You could check that with a packet capture (passing --pcap / -p <file.pcap> to pasta).

Comment 12 Laurent Jacquot 2024-02-03 12:45:57 UTC

Created attachment 32 [details]
capture of flows in the container

when using -o 2a01:e0a:1f1:9700::14 only, the namespace gets attributed
the ::13 address which is already used in another rootless container.
To avoid confusion I added a '-a -n' option:

pasta --debug --trace -l /home/pod/log --pcap /home/pod/pcap1 --config-
net -a 2a01:e0a:1f1:9700::f14 -n 2a01:e0a:1f1:9700::/64 -o
2a01:e0a:1f1:9700::14  -u 5300

Attached is the pcap captured which seems to be correct: the dns
request arrives, a querying to the upstream is done, answer received
and response replied.. but the remote client never sees it

I added also just in case the log file

Comment 13 Laurent Jacquot 2024-02-03 12:46:47 UTC

Created attachment 33 [details]
log from pasta --trace

Comment 14 David Gibson 2024-02-06 05:43:12 UTC

Laurent, thanks for the simplified scenario.  Unfortunately, I nonetheless haven't managed to reproduce the problem myself.  So, next batch of debugging queries:

1) If you start the pasta/server, does the dig fail immediately, or does it only fail if you run the dig some time later?

2) Does the problem occur if you omit the -o option, and just direct the dig at any of the host addresses?

3) The -p option you used captures a packet trace as viewed from the guest <-> pasta link.  Can you also run tcpdump on the host while running this test, so that we can see the packets as seen by the host / external system.


As you say, the guest side pcap looks correct, so my best guess is that pasta is not correctly translating the reply packets back to the external addresses.  I'm not sure why that would be the case, though.

Btw, I'm pretty sure the -n option you're adding will have no effect: -n is strictly for setting an IPv4 netmask and should have no effect on IPv6.  Plus it only affects what's advertised via DHCP, which you're not using here (--config-net directly configures the namespace, without using DHCP).

Comment 15 Laurent Jacquot 2024-02-07 20:19:27 UTC

Hello
> 1) If you start the pasta/server, does the dig fail immediately, or does it only fail if you run the dig some time later?
It fails immédiately

> 2) Does the problem occur if you omit the -o option, and just direct the dig at any of the host addresses?
dig fails for the ::14 and the ::4 adress, but works with ::13

looking at the code, I see that conf.c takes care of every config options using udp_sock_init(c, 0, af, addr, ifname, i); => this correctly configure the -u option, espacially the ifname and addr parameter

Just after, udp_init() is called which use the udp_sock_init_init() function: this one overload all the conf alreday setup without any options:
dp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, dst);

the result is pasta listening on everything:
ss -aun |grep 5300
UNCONN 0      0                                0.0.0.0:5300        0.0.0.0:*          
UNCONN 0      0                                   [::]:5300           [::]:*          
UNCONN 0      0                [2a01:e0a:1f1:9700::14]:5300           [::]:*      

If I comment out the call to udp_sock_init_init() in udp_init() the result is 
ss -aun |grep 5300   
UNCONN 0      0                [2a01:e0a:1f1:9700::14]:5300           [::]:*     

and indeed everything works.


cmdline used that works with udp_sock_init_init() call commented:
pasta --config-net -a 2a01:e0a:1f1:9700::f14  -u 2a01:e0a:1f1:9700::14/5300

here is the following patch:
--- udp.c.orig	2024-02-07 20:29:06.132032043 +0100
+++ udp.c	2024-02-07 19:27:55.035453812 +0100
@@ -1128,7 +1128,7 @@
 
 	if (c->mode == MODE_PASTA) {
 		udp_splice_iov_init();
-		udp_sock_init_init(c);
+	//	udp_sock_init_init(c);
 		NS_CALL(udp_sock_init_ns, c);
 	}

but udp_sock_init_init() is then unused => if confirmed, I can send a proper patch.

Comment 16 David Gibson 2024-02-08 06:03:38 UTC

Good catch, thank for your analysis Laurent.

Something is certainly wrong here, though I'm not quite sure it's as simple as removing the udp_sock_init_init() call. Some observations:

* The udp_tap_map[] array tracks sockets both for fixed forwardings (-u NNNN) as well as automatically created forwardings (-u auto). This differs from the somewhat similar tcp_sock_init_ext[] which tracks *only* automatically created listening sockets. Explicitly created listening sockets and "free floating" - they're not in any data structure, and we handle events on them based purely on the information in their epoll reference.

I suspect this subtle difference may have been one of the causes of incorrectly setting things up for UDP here.

* This isn't exactly the issue here, but a somewhat related one. If one were to set up forwarding with an explicit address (e.g. -u 1.2.3.4/5300) and also use the -o option with a different address (-o 4.3.2.1) we'd need to determinme which one took priority in setting the source address for outbound reply packets to the forwarded address. I think the explicit -u address should win - both for utility and to match TCP behaviour. However, our current data structures aren't actually capable of handling that correctly. I'm working on this gradually.

* The ss output in comment 15 shows the system listening on port 5300 on both the [::] and [2a01:e0a:1f1:9700::14] address. This should conflict - I would have expected whichever socket was bound second to get EADDRINUSE. I recall a suspiciously similar kernel bug that allowed this some time ago, so I'm wondering if you have an affected kernel version. Unfortunately, I've been unable to track down details of the bug/version in question.

* Those calls to udp_sock_init_init() and udp_sock_init_ns() were added when we didn't have periodic polling for UDP listening ports. We've since added that (commit 457ff122e33c). So, those calls are probably not necessary - TCP only does the equivalent set up of auto listening ports on the first timer call, so I guess that should be ok for UDP as well. If we did need an initial setup pass we should do it using the same udp_port_rebind() function that we use for the timer, rather than duplicating the logic slightly different and wrong as we do now.

* That said, the existing auto binding should probably respect the -o option, and currently it doesn't. To be fair, exactly what the semantics should be are kinda fuzzy due to the non-obvious interactiongs between bound/listening address and "output" address. I'm working towards structural changes that will make that saner, but it's a ways off.

Comment 17 Stefano Brivio 2024-02-08 12:36:40 UTC

(In reply to David Gibson from comment #16)
> Good catch, thank for your analysis Laurent.
> 
> Something is certainly wrong here, though I'm not quite sure it's as simple
> as removing the udp_sock_init_init() call.  Some observations:
> 
> * The udp_tap_map[] array tracks sockets both for fixed forwardings (-u
> NNNN) as well as automatically created forwardings (-u auto).  This differs
> from the somewhat similar tcp_sock_init_ext[] which tracks *only*
> automatically created listening sockets.  Explicitly created listening
> sockets and "free floating" - they're not in any data structure, and we
> handle events on them based purely on the information in their epoll
> reference.
> 
>   I suspect this subtle difference may have been one of the causes of
> incorrectly setting things up for UDP here.
I don't recall the details of my reasoning exactly, but this was pretty much intended (which doesn't necessarily mean it's correct, of course): for TCP, after we accept a connection on a listening socket, the socket from accept() will retain the initial addressing. For UDP, we need to track which addresses (loopback, link-local, or GUA) were used for that specific port, because the equivalent socket doesn't "remember" it for us.

But actually I'm not sure how this is relevant here.

> * This isn't exactly the issue here, but a somewhat related one.  If one
> were to set up forwarding with an explicit address (e.g. -u 1.2.3.4/5300)
> and also use the -o option with a different address (-o 4.3.2.1) we'd need
> to determinme which one took priority in setting the source address for
> outbound reply packets to the forwarded address.  I think the explicit -u
> address should win - both for utility and to match TCP behaviour.
I think so too -- after all, we currently have --outbound documented as:

  Use an IPv4 addr as source address for IPv4 outbound TCP connections, UDP flows, ICMP requests, or an IPv6 addr for IPv6 ones, by binding outbound sockets to it.

and I would say that inbound port forwarding doesn't lead to "outbound" flows.

> However,
> our current data structures aren't actually capable of handling that
> correctly.  I'm working on this gradually.
>
> * The ss output in comment 15 shows the system listening on port 5300 on
> both the [::] and [2a01:e0a:1f1:9700::14] address.  This should conflict - I
> would have expected whichever socket was bound second to get EADDRINUSE.  I
> recall a suspiciously similar kernel bug that allowed this some time ago, so
> I'm wondering if you have an affected kernel version.  Unfortunately, I've
> been unable to track down details of the bug/version in question.
Report at https://lore.kernel.org/stable/CAFsF8vL4CGFzWMb38_XviiEgxoKX0GYup=JiUFXUOmagdk9CRg@mail.gmail.com/, issue fixed by commit 7a7160edf1bf ("net: Return errno in sk->sk_prot->get_port().").

> * Those calls to udp_sock_init_init() and udp_sock_init_ns() were added when
> we didn't have periodic polling for UDP listening ports.  We've since added
> that (commit 457ff122e33c).  So, those calls are probably not necessary -
> TCP only does the equivalent set up of auto listening ports on the first
> timer call, so I guess that should be ok for UDP as well.  If we did need an
> initial setup pass we should do it using the same udp_port_rebind() function
> that we use for the timer, rather than duplicating the logic slightly
> different and wrong as we do now.
At a quick glance, I would also say they're not needed, so I would appreciate a patch dropping calls and functions... but that patch should also make sure we run the timer right away (or equivalent). That is, something like:

  pasta --config-net -t auto -u auto -- iperf3 -u

(without a delay before iperf3 starts) should work too.

> * That said, the existing auto binding should probably respect the -o
> option, and currently it doesn't.  To be fair, exactly what the semantics
> should be are kinda fuzzy due to the non-obvious interactiongs between
> bound/listening address and "output" address.  I'm working towards
> structural changes that will make that saner, but it's a ways off.
I guess that can wait -- either until your flow table changes are complete, or until somebody reports a use case for it. I'm not aware of any at the moment.

Comment 18 David Gibson 2024-02-09 05:32:12 UTC

(In reply to Stefano Brivio from comment #17)
> (In reply to David Gibson from comment #16)
> > Good catch, thank for your analysis Laurent.
> > 
> > Something is certainly wrong here, though I'm not quite sure it's as simple
> > as removing the udp_sock_init_init() call.  Some observations:
> > 
> > * The udp_tap_map[] array tracks sockets both for fixed forwardings (-u
> > NNNN) as well as automatically created forwardings (-u auto).  This differs
> > from the somewhat similar tcp_sock_init_ext[] which tracks *only*
> > automatically created listening sockets.  Explicitly created listening
> > sockets and "free floating" - they're not in any data structure, and we
> > handle events on them based purely on the information in their epoll
> > reference.
> > 
> >   I suspect this subtle difference may have been one of the causes of
> > incorrectly setting things up for UDP here.
> I don't recall the details of my reasoning exactly, but this was pretty much
> intended (which doesn't necessarily mean it's correct, of course): for TCP,
> after we accept a connection on a listening socket, the socket from accept()
> will retain the initial addressing. For UDP, we need to track which
> addresses (loopback, link-local, or GUA) were used for that specific port,
> because the equivalent socket doesn't "remember" it for us.
> 
> But actually I'm not sure how this is relevant here.

I realise why the difference is there: as you say, "free floating" fds just wouldn't work for udp.  It's relevant only insofar as the equivalent code in TCP would be safe, because it wouldn't clobber setup for explicit port forwardings.

> > * This isn't exactly the issue here, but a somewhat related one.  If one
> > were to set up forwarding with an explicit address (e.g. -u 1.2.3.4/5300)
> > and also use the -o option with a different address (-o 4.3.2.1) we'd need
> > to determinme which one took priority in setting the source address for
> > outbound reply packets to the forwarded address.  I think the explicit -u
> > address should win - both for utility and to match TCP behaviour.
> I think so too -- after all, we currently have --outbound documented as:
> 
>   Use an IPv4 addr as source address for IPv4 outbound TCP connections, UDP
> flows, ICMP requests, or an IPv6 addr for IPv6 ones, by binding outbound
> sockets to it.
> 
> and I would say that inbound port forwarding doesn't lead to "outbound"
> flows.

Makes sense.

> > However,
> > our current data structures aren't actually capable of handling that
> > correctly.  I'm working on this gradually.
> >
> > * The ss output in comment 15 shows the system listening on port 5300 on
> > both the [::] and [2a01:e0a:1f1:9700::14] address.  This should conflict - I
> > would have expected whichever socket was bound second to get EADDRINUSE.  I
> > recall a suspiciously similar kernel bug that allowed this some time ago, so
> > I'm wondering if you have an affected kernel version.  Unfortunately, I've
> > been unable to track down details of the bug/version in question.
> Report at
> https://lore.kernel.org/stable/
> CAFsF8vL4CGFzWMb38_XviiEgxoKX0GYup=JiUFXUOmagdk9CRg@mail.gmail.com/, issue
> fixed by commit 7a7160edf1bf ("net: Return errno in
> sk->sk_prot->get_port().").

Thanks.  Laurent, does it look like you might have an affected kernel?

> > * Those calls to udp_sock_init_init() and udp_sock_init_ns() were added when
> > we didn't have periodic polling for UDP listening ports.  We've since added
> > that (commit 457ff122e33c).  So, those calls are probably not necessary -
> > TCP only does the equivalent set up of auto listening ports on the first
> > timer call, so I guess that should be ok for UDP as well.  If we did need an
> > initial setup pass we should do it using the same udp_port_rebind() function
> > that we use for the timer, rather than duplicating the logic slightly
> > different and wrong as we do now.
> At a quick glance, I would also say they're not needed, so I would
> appreciate a patch dropping calls and functions... but that patch should
> also make sure we run the timer right away (or equivalent). That is,
> something like:
> 
>   pasta --config-net -t auto -u auto -- iperf3 -u
> 
> (without a delay before iperf3 starts) should work too.

So.. I'm inclined to leave the timer as is: it will then behave the same as TCP, which doesn't do this setup until the timer first runs.  I'm not sure if we have any logic to run the timer early.

Note that running the timer / port setup *before* we start the command isn't useful: this is for automated port forwarding, so it won't see anything until something in the guest is actually listening.

The problem in this bug occurs not because of a failure in the automatic forwarding, but because a misguided attempt to do early setup for the automatic forwarding is clobbering a perfectly good explicit forwarding setup.

> > * That said, the existing auto binding should probably respect the -o
> > option, and currently it doesn't.  To be fair, exactly what the semantics
> > should be are kinda fuzzy due to the non-obvious interactiongs between
> > bound/listening address and "output" address.  I'm working towards
> > structural changes that will make that saner, but it's a ways off.
> I guess that can wait -- either until your flow table changes are complete,
> or until somebody reports a use case for it. I'm not aware of any at the
> moment.

Right.

Comment 19 David Gibson 2024-02-09 06:03:00 UTC

Laurent,

I've made a draft of a more polished version of your fix at https://gitlab.com/dgibson/passt/-/tree/bug79?ref_type=heads

Please give it a test for your setup.

Btw, for your particular use case, I think explicitly giving a listening address in the -u option probably better expresses what you want than the -o option.  It doesn't really make a difference for the fix, though: without it neither will work, and with it both should.

Stefano,

I'll post the suggested changes once I've completed a testing cycle.

Comment 20 David Gibson 2024-02-09 06:12:44 UTC

Something is wrong with my draft, it's causing a regression in the test suite.  Debugging...

Comment 21 Laurent Jacquot 2024-02-09 12:18:04 UTC

(In reply to David Gibson from comment #20)
> Something is wrong with my draft, it's causing a regression in the test
> suite.  Debugging...

Hello
I did not propose to remove the NS_CALL(udp_sock_init_ns, c) because if I understand correctly, the conf_ports() call in the initial setup is done in the host namespace, but it should be done in the newly created namespace shouldn't it?

NS_CALL(udp_sock_init_ns, c) in udp_init() takes care of that, but does not carry -U and -o information..

Concerning my kernel, it's the latest from f39: uname -r
6.6.14-200.fc39.x86_64

commit 7a7160edf1bf seems to be there and if I test the bug as per https://bugzilla.redhat.com/show_bug.cgi?id=2159066#c1, the test correctly fails

nc -l 5000 &
nc-l 5000
Ncat: bind to :::5000: Address already in use. QUITTING.

Comment 22 David Gibson 2024-02-12 04:10:47 UTC

Laurent,

Yes, indeed removal of udp_sock_init_ns() was the problem.  I realise you didn't suggest that initially, but at first look it appeared to have the same problem - I hadn't realised that of course we couldn't open these sockets during conf_ports() because the namespace doesn't exist yet.

Although we can't remove it entirely, I think it's still not quite correct, and unnecessarily duplicates the logic of udp_port_rebind_outbound().  I'm working on a new patch based on this now.

Comment 23 David Gibson 2024-02-12 04:17:09 UTC

New version pushed to gitlab, I'll post once it's had a more complete test run.

Comment 24 Laurent Jacquot 2024-02-12 18:41:19 UTC

Created attachment 34 [details]
Fix  typo in udp_sock_init()

Hello,

using udp_port_rebind_outbound() seems the right thing to do but maybe you could consider in the mean time the attached patch that enables the -U option.
Namely, a typo in the code makes handling packets back to calling process in the namespace fail:
You have to replace udp_splice_ns[V6][src].sock with udp_splice_ns[V6][uref.port].sock in udp_sock_init():

diff -ru passt.orig/udp.c passt-f091893c1ffe1a531989a599737031089f6cfcb4/udp.c
--- passt.orig/udp.c	2023-12-30 11:45:27.000000000 +0100
+++ passt-f091893c1ffe1a531989a599737031089f6cfcb4/udp.c	2024-02-12 19:28:03.047904772 +0100
@@ -1014,7 +1014,7 @@
 
 			r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, &loopback,
 					 ifname, port, uref.u32);
-			udp_splice_ns[V4][port].sock = s < 0 ? -1 : s;
+			udp_splice_ns[V4][uref.port].sock = s < 0 ? -1 : s;
 		}
 	}
 
@@ -1031,7 +1031,7 @@
 			r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP,
 					 &in6addr_loopback,
 					 ifname, port, uref.u32);
-			udp_splice_ns[V6][port].sock = s < 0 ? -1 : s;
+			udp_splice_ns[V6][uref.port].sock = s < 0 ? -1 : s;
 		}
 	}


the rest of the patch is cosmetic and documentation.
regards
Laurent

Comment 25 David Gibson 2024-02-13 00:51:09 UTC

(In reply to Laurent Jacquot from comment #24)
> Created attachment 34 [details]
> Fix  typo in udp_sock_init()
> 
> Hello,
> 
> using udp_port_rebind_outbound() seems the right thing to do but maybe you
> could consider in the mean time the attached patch that enables the -U
> option.

To clarify, are you saying that the patch to use udp_port_rebind_outbound() breaks -U?  Or that it was already broken?

I'm guessing from the patch that it's only broken in the case where you're forwarding the internal port to a different number.  Is that correct?


The attached patch seems to be doing 3 quite different things, and I'm not really sure of the rationale for each of them:

 1) It's moving the parsing of outbound forwarding options (-T, -U) a bit earlier.  I'm not sure what the purpose of this is.

 2) It's adding some clarifying changes to the man page and some comments.  Nice to have, but obviously not a fix of itself.

 3) It's making this change..

> Namely, a typo in the code makes handling packets back to calling process in
> the namespace fail:
> You have to replace udp_splice_ns[V6][src].sock with
> udp_splice_ns[V6][uref.port].sock in udp_sock_init():

This one doesn't seem correct to me, at least at first glance.  Since udp_splice_ns[] is describing things within the namespace, it should be indexed in terms of the port number as seen in the namespace ('port'), not the port number it will be translated to on the host ('uref.port').  Am I missing something?

> 
> diff -ru passt.orig/udp.c
> passt-f091893c1ffe1a531989a599737031089f6cfcb4/udp.c
> --- passt.orig/udp.c	2023-12-30 11:45:27.000000000 +0100
> +++ passt-f091893c1ffe1a531989a599737031089f6cfcb4/udp.c	2024-02-12
> 19:28:03.047904772 +0100
> @@ -1014,7 +1014,7 @@
>  
>  			r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, &loopback,
>  					 ifname, port, uref.u32);
> -			udp_splice_ns[V4][port].sock = s < 0 ? -1 : s;
> +			udp_splice_ns[V4][uref.port].sock = s < 0 ? -1 : s;
>  		}
>  	}
>  
> @@ -1031,7 +1031,7 @@
>  			r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP,
>  					 &in6addr_loopback,
>  					 ifname, port, uref.u32);
> -			udp_splice_ns[V6][port].sock = s < 0 ? -1 : s;
> +			udp_splice_ns[V6][uref.port].sock = s < 0 ? -1 : s;
>  		}
>  	}
> 
> 
> the rest of the patch is cosmetic and documentation.
> regards
> Laurent

Comment 26 Laurent Jacquot 2024-02-13 12:46:53 UTC

(In reply to David Gibson from comment #25)

> To clarify, are you saying that the patch to use udp_port_rebind_outbound()
> breaks -U?  Or that it was already broken?
It is already broken, and as the rework of udp_port_rebind_outbound() might take time, I propose this quick change to enable port forwarding if asked via -U option.

> 
> I'm guessing from the patch that it's only broken in the case where you're
> forwarding the internal port to a different number.  Is that correct?
> 
> 
> The attached patch seems to be doing 3 quite different things, and I'm not
> really sure of the rationale for each of them:
> 
>  1) It's moving the parsing of outbound forwarding options (-T, -U) a bit
> earlier.  I'm not sure what the purpose of this is.

As the 'T' and 'U' option is only parsed (no binding occurs, see conf_ports()), I saw no point in doing another full argv loop after pasta_start_ns(), so I inserted them a the same time as 't' and 'u' and documented it.. This is cosmetic, feel free to discard it

> 
>  2) It's adding some clarifying changes to the man page and some comments. 
> Nice to have, but obviously not a fix of itself.
you are right, but tried somthing like -U @ipv6%ifname/5353:53 with no effect just to understand later that the core structures cannot support that. I thought it was a good idea to clarify.


> 
>  3) It's making this change..
> 
> > Namely, a typo in the code makes handling packets back to calling process in
> > the namespace fail:
> > You have to replace udp_splice_ns[V6][src].sock with
> > udp_splice_ns[V6][uref.port].sock in udp_sock_init():
> 
> This one doesn't seem correct to me, at least at first glance.  Since
> udp_splice_ns[] is describing things within the namespace, it should be
> indexed in terms of the port number as seen in the namespace ('port'), not
> the port number it will be translated to on the host ('uref.port').  Am I
> missing something?
> 
Yes it seemed also odd to me, but IIUC the reply packet (comming from host to ns) is processed by udp_splice_sendfrom() in the following sniplet:
if (from_pif == PIF_SPLICE) {
        this is for packets from ns to host
} else {
                ASSERT(from_pif == PIF_HOST);
                src += c->udp.fwd_out.rdelta[src]; /*<-- src is 53 when the host dns server reply to the calling process
                s = udp_splice_ns[v6][src].sock;
...
}

Comment 27 Laurent Jacquot 2024-02-13 17:40:10 UTC

(In reply to Laurent Jacquot from comment #26)
> (In reply to David Gibson from comment #25)
> 
> > To clarify, are you saying that the patch to use udp_port_rebind_outbound()
> > breaks -U?  Or that it was already broken?
> It is already broken, and as the rework of udp_port_rebind_outbound() might
> take time, I propose this quick change to enable port forwarding if asked
> via -U option.
> 
> > 
> > I'm guessing from the patch that it's only broken in the case where you're
> > forwarding the internal port to a different number.  Is that correct?
> > 
I realise I am crossing issues :-(
-u problem: the patch you sent to the dev-list works for me, problem closed as far as I'm concerned.
-U problem: still pending wih the src vs uref.port discussion. I confirm without the fix it's broken when forwarding the internal port to a different number.
eg. pasta -U 4343:53

udp_port_rebind_outbound() has nothing to do with the later. Sorry for the confusion

Comment 28 David Gibson 2024-02-14 01:41:51 UTC

(In reply to Laurent Jacquot from comment #27)
> (In reply to Laurent Jacquot from comment #26)
> > (In reply to David Gibson from comment #25)
> > 
> > > To clarify, are you saying that the patch to use udp_port_rebind_outbound()
> > > breaks -U?  Or that it was already broken?
> > It is already broken, and as the rework of udp_port_rebind_outbound() might
> > take time, I propose this quick change to enable port forwarding if asked
> > via -U option.
> > 
> > > 
> > > I'm guessing from the patch that it's only broken in the case where you're
> > > forwarding the internal port to a different number.  Is that correct?
> > > 
> I realise I am crossing issues :-(
> -u problem: the patch you sent to the dev-list works for me, problem closed
> as far as I'm concerned.

Ok, good to know.  I hope this fixed will be merged soon.

> -U problem: still pending wih the src vs uref.port discussion. I confirm
> without the fix it's broken when forwarding the internal port to a different
> number.
> eg. pasta -U 4343:53

Ok.  For the purposes of tracking, could you file this as a new BZ please?

Fwiw, I also haven't been able to reproduce this - -U seemed to work as expected for me, both when changing the port number and when not.  If you could give a complete example of the steps you're using when attempting this in the new BZ, that would be most helpful.

> udp_port_rebind_outbound() has nothing to do with the later. Sorry for the
> confusion

Ok, understood.

Comment 29 Laurent Jacquot 2024-02-14 13:17:29 UTC

Closing this BZ as fixed.
Thank you for your help (and this great software)

Note You need to log in before you can comment on or make changes to this bug.