Sweet. I recorded a video of the network info from the server's perspective. The "Waiting for players..." problem starts at 1:37 when the withheld messages starts counting up. Let me know what you think! You may want to view in HD mode to see the numbers easier.
http://www.youtube.com/watch?v=oIKpJrm9Q0g
Thanks!
-Jeff
That video was incredibly helpful in at least understanding what is happening, if not yet why -- thanks!
Okay, so here is what is happening -- in the network layer, a packet is getting lost, and then never getting to the destination, so then all of the future packets are withheld. To give an example of what I mean, suppose we have a message A, that gets split into 5 parts for sending (this is at the packet level, so below the level of the game itself). So, you have, as messages:
A1
A2
A3
A4
A5
Because of the unreliable nature of UDP, those might arrive in any order, and then they have to be reconstructed as the one big message "A," in the proper order. Additionally, if some of them are dropped (packet lost), those will be resent, and
then message A will be reconstructed. So, again for example, in a normal case you might get this:
A2 (3ms)
A3 (3ms)
A5 (3ms)
A4 (3ms)
<resend dropped packet> (3ms)
A1 (3ms)
Or something along those lines. The network library (lidgren network) handles all of this, and basically waits until all those parts are collected, then hands them over to the game, where they are actually reconstructed. In this example above, you would have 4 withheld messages until A1 finally arrives, and then it would reconstruct and move on.
Given the commonness of some packet loss, this whole scenario is utterly common and resolves itself in less than a quarter of a second, generally speaking. What is happening in your case, for some reason, is that a huge number of packets are going into the withheld state, because they are still waiting on the resend and arrival of packets that should have come first (so, in other words, A1 never arrives, and everything else after that point just keeps queuing up).
I've never seen this sort of thing with the lidgren library before, so I don't know exactly what the trouble is. I will send a note to the author of that library and see what he thinks, what possible causes might be. To my mind, there are a few possibilities:
1. Some sort of bug in the lidgren library (probably unlikely, or others would be reporting it more, but it could be an "edge case" of some kind).
2. Something wrong with one of the network adapters, as I'd noted.
3. Something on one of your two routers that "doesn't like" that A1 packet, and is filtering it out as malicious or whatever else (application-level content filtering on a firewall).
4. Some sort of QoS service on one of your two routers that is incorrectly managing the stream of packets for some reason (very strange that Hamachi would act differently).
5. Some sort of other routing bug, at the ISP level, or one one of the routers, or something (that a router firmware update on both the routers might fix for you, if they are not already up to date).
My suggestion would be to at least look into possibilities 2 through 5, since this is something that -- so far at least -- seems isolated to you and your friend's case alone. That doesn't mean another case isn't going to come up in another 5 minutes or 5 days or 5 months from now, but with these sort of edge cases, especially with the new data you've provided, my suspicion is something even below the network library layer, getting down to the network driver or further out on the network itself.
Meanwhile, I will contact Lidgren and see if he has any thoughts on the possibility 1 above, since this may prove a useful addition to his
FAQ about the library, anyway. Being that it's the weekend, and that I have not been in communication with him in the past, I don't know what sort of response time to expect -- I'll keep you up to date when I hear anything, though. Thanks for your patience!