While looking at how we might fix bug 60, I discovered a number of other problems with how we use netlink: 1. We use 0 rather than 1 as our first sequence number. This works, but isn't ideal, since 0 is usually used for asynchronous notifications for the kernel. 2. When duplicating routes, we send a number of request messages in a single packet. This may result in multiple response packets, but we only process the first one. This means that responses can get out of sync for subsequent operations. This is mitigated by the "flush" logic at the start of nl_req(), but that will only get rid of one stale response, there could be multiple 3. When duplicating routes we send the same batch of requests multiple times, since earlier attempts might fail due to route dependencies. However when we resumbit the requests we also reuse the sequence numbers. This appears to work, but isn't how you're generally supposed to use netlink. 4. In general we only process one reply datagram from a request, but it appears that the response can sometimes be split across multiple datagrams: in particular dump requests seem to have the actual responses and the NLMSG_DONE marker in separate datagrams. The ' flush' logic in nl_req() again appears to handle this, but in a rather confusing way (we deal with extra packets on the next request, rather than as part of the request that prompted them).
I'm working on a series to address these issues.
Fixes now merged.