Okay, the fix for this is now fixed in
version S! This was a really screwy issue, and I'm pretty sure it's a bug in the Lidgren Network library, or else I'm misunderstanding how some of it is supposed to work. Let me try to explain what was going on, since I know a number of you are curious
First, a bit of background:
-------------------------
1. The game sends all data via UDP, through the lidgren network, which makes sure that things are Reliable (all messages get there), and Ordered (they get there in the right order). LN is also responsible for things like resending when packets/messages don't arrive, and reconstituting messages on the other side when they are split across multiple packets (as will happen, since messages can be an infinite size).
2. On top of this, AI War has its own messaging system, with what I call "Player Messages." Normally, these messages are reasonably small (under 3-4KB at most), and during the game these are sent to the server, and from the server, every 200ms. During the game, it has 200ms in which to process the network sending without causing lag, which works beautifully in almost all circumstances at this stage.
3. When AI War does a full sync (such as loading a savegame), it used to create on giant PlayerMessage that has the entire save file in it (so this can range from 200KB to 1.5MB or more). More recently, I've switched it so that it breaks this one big message up into 100 sub-messages, so that this can feed the progress percentage for the sync (the client checks how many it has received, out of how many the host has told this to expect, and both shows this to the client player as well as reports this to the host so it can show it to the host player). All 100 of these messages were being dumped into the internal lidgren queue immediately, and as you can guess many of these messages could individually be from 20-150KB depending on the size of the save file.
The Symptoms:
---------------
The problem, after much experimenting and testing with my dad (who was able to duplicate this with his connection, was indeed resent packets. Using some new network debugging info that I added (Shift+F3 in the new release), I was able to see this definitively. His machine was sending data constantly, but it was duplicative and really jumbled in general.
The Problem:
------------
The symptoms aren't necessarily an indication that there is packet loss, even though that is normally the cause of resends of this sort. Instead, from what I could tell in this case, it was a case of a traffic jam. Lidgren waits a certain amount of time before doing a resend on each message if it does not get a response, but in this case there were so many large queued messages that this was causing a massive number of resends, which then caused other various network-contention problems.
In addition, this was causing so much flooding of the network card that it was sometimes literally killing my dad's Skype connection, which is just crazy. My understanding had been that the Lidgren library was going to send each one in turn, properly queuing each one, but evidently there was a lot more overlap than that, and the result was a huge mess.
The Solution:
-------------
Fortunately, AI War already has a message-caching queue of its own, for when the network is temporarily unavailable. This message-caching queue slowly doles out its messages to each connection to the server, one message per game cycle (So 20 messages per second per connection). This was ideal for my needs here, because basically I needed a way to handle my own queue since Lidgren was not doing a good job of it (or because I was misinterpreting what Lidgren was supposed to be doing, but either way it amounts to the same thing).
So now these partial messages for the sync are dumped into a queue to be doled out one per connection per 30ms. This basically means that if I want to avoid contention issues that would lead to lots of resends, I'd need to make each partial message small enough to be sent well within that 30ms interval. I settled on 4.68KB max per message after some experimentation, and so now the sync data is divided into however many parts it will take to make sub-messages of that size, with a minimum of 200 parts.
The Caveats:
-------------
This solution works extremely well, it would seem from testing it so far, though it is not without some drawbacks. Basically, if you have 8 players connecting all at once, and your server's network connection is not up to par, you might see a recurrence of these contention issues. Or if your network connection is not able to keep up in general, basically, you might see some resends that are wasteful. However, the mitigating factor here is that the messages are individually vastly smaller than they were in the past, so even in the worst case it should not be taking 20-30 minutes to connect even 8 players. But some of this is going to need more testing under various network conditions. I may need to make some sort of settings option to allow for some tuning of this in edge cases. But, hopefully not -- and it should still be vastly better even for the worst edge cases, so that's an improvement either way.
The other drawback is more minor, but still worth noting. Essentially, this solution works as a sort of speed limit for the transfer, in addition to causing it to not overlap. Given the asynchronous, unreliable nature of UDP, this is pretty much necessary for now, though I may revisit it someday when more data comes in. So this basically means that if you are connecting over a LAN and used to superfast loads, it's going to be way slower now -- more like an Internet load. For a 1.5MB file, assuming you have a connection that is able to send at full speed or better, it will take 30-40 seconds on average, which is slower than before (10ish seconds), but still quite fast in general.
The nice thing is that even the Internet connections, like my dad's now sends to my machine in 30-40 seconds instead of taking more like 3-5 minutes. So this somewhat normalizes the load times (depending on filesize), assuming reasonable connections. Even for people on the LAN, this is an improvement as it won't stress out their network card in the same way (thus avoiding potential interference with other network devices or programs). So overall this should be a win for everyone, though I'm not thrilled about having a hard speed limit for transfers in the game (doesn't seem as future-proof as I'd like).
Next Steps:
-----------
The future-proofing aspects can be explored more, during the expansion work after 2.0, though. As long as this works reasonably well for everyone, and doesn't cause any major problems for anyone, this is going to be the final sync code for 2.0; there's too much risk of breaking something else the day before release if I try to go crazy with this.
Next steps for this is for people who are affected by this issue to give it a try when possible and to see if it works. Hopefully everyone will have the same success as with my dad's connection. Please let me know what you find, and please post a screen shot of the Shift+F3 data from the host if you're reporting a problem. Thanks!