Author Topic: [Solved] "Waiting for players..." when move ~150 ships  (Read 11326 times)

Offline jeffhoye

  • Newbie
  • *
  • Posts: 3
[Solved] "Waiting for players..." when move ~150 ships
« on: October 30, 2009, 07:10:08 pm »
Hello.  Love the game, but we're having stability problems in multiplayer over the Internet.  When one of us moves any significant number of ships, that player drops out, and never comes back.  Our work-around has been to cancel/rejoin.  However when you get about 1/2 way through a campaign, there just seems to be too much going on for the game keeps loosing the players every few seconds.

I looked around for exceptions files and such, but we're getting a lot more failures than represented in DecompressStringErrors.txt or ClientReceivingData.txt files (though, they had a few errors):

ERROR: System.FormatException: Invalid character in a Base-64 string.
   at System.Convert.FromBase64String(String s)
   at AIWar.Compressor.DecompressStringBig(String compressedText) in C:\vcprojs\AIWar\Framework\Compressor.cs:line 49
10/25/2009 3:59:18 PM

Setup:
  Tried opening a firewall port, but connection hardly ever works though that.  Using Hamachi.  Connection seems to work about 1/2 the time.  

Server: Windows Vista 64, 4GB ram, Core2 Duo [email protected], ATI Radeon HD 4870 X2, Fios Connection
Client: 2yr old box running Vista(32-bit) 3GB ram (don't know the rest of the specs), (unknown internet connection)
Also used a Windows XP laptop w/ 2GB ram.

Tried all combinations of server/client.

Works fine on the Lan, or via Hamachi between the XP laptop and Vista 64 computer.

Tried Uninstalling/Reinstalling .NET 3.5 SP1

What can I do to get you some trace files or such?  We're willing to run debug versions.  

Thanks!
-Jeff
« Last Edit: November 17, 2009, 01:58:22 pm by x4000 »

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Re: "Waiting for players..." when move ~150 ships
« Reply #1 on: October 30, 2009, 08:38:03 pm »
Hi there,

Sorry for the troubles it sounds like you are having there.  Can you try one thing?  Under Settings, Game, Network Sync Performance, can you try checking the box for "Skip Batch Network Sending" to see if that solves it for you?  Check that box on the host and the clients, and then clear the error logs that are there, and let's see what happens.  The issue you are describing is not a known issue, I haven't heard of that specific one before, but I think it may be related to another recently-discovered issue (for some people) for which the above fix should work 100% of the time.

If that doesn't do it, then let's see what else we can figure out.  Knowing the speed of the two connections would be helpful:  http://www.speedtest.net/

Hope that helps!
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!

Offline jeffhoye

  • Newbie
  • *
  • Posts: 3
Re: "Waiting for players..." when move ~150 ships
« Reply #2 on: October 30, 2009, 09:52:51 pm »
Checking the box on both sides, deleted those error files.  Still failed quickly.  Error files weren't recreated after the failure.

Here's the speedtest:
Server: http://www.speedtest.net/result/607769961.png
Ping 8ms, DL:26(Mb/s) UL:16

Client: http://www.speedtest.net/result/607771688.png
Ping 57ms, DL:13, UL:2


Then we're trying to see what our loss rate is like (this is over hamachi):

Looks like the latency can get to 1/3 second.

$ ping -f (anon) 1340 
PING (anon): 1340 data bytes
.........................................................................
---- (anon) PING Statistics----
2329 packets transmitted, 2256 packets received, 3.1% packet loss
round-trip (ms)  min/avg/max/med = 64/256/351/322

Any ideas?
-Jeff

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Re: "Waiting for players..." when move ~150 ships
« Reply #3 on: October 31, 2009, 11:12:44 am »
Those speed tests look plenty fine, should be no issues there based on speed, especially not with so few units in a group causing an issue (it should work fine with 1500 ships being moved at a time with that, let alone 150).

Okay, so I've been thinking about this a lot, and poked around in the code some, too.  It sounds like potentially your router is choking on the data -- you said this works fine across the LAN, right?

When the game is first connecting and your Skip Batch Send is turned off, I'm really surprised it doesn't also choke then, because there's even more data being passed across.  That sort of discrepancy would normally make me look for a code issue, except you aren't getting an error message of any sort (which means that the game has already handed off the data by the time that data gets lost or whatever is happening), and the fact that the same build is not exhibiting this for the hundreds of other players currently running it.

So, that makes me wonder about QoS logic on the routers, or something of that nature. My first look would actually be at the network cards, but you say it works over the LAN. It's still possible that they are the problem, and that what you are experiencing is not actually a dropped connection, but just a severely delayed one. If it's not giving you the "Reconnect to Server?" message on the client, that would also be in support of that.

Generally speaking, the first thing I would do with that sort of thing would be to make sure that the network drivers for the cards on both computers are up to date.  However, the fact that it works better through Hamachi compared to through your raw routers is again pointing me to perhaps some sort of issue on the router itself.  You might take a look at the router configuration and see if there is some sort of application-level content filtering going on, or if it has QoS services turned on (common when on a VOIP-enabling router, like those from Vonage).  If so, you might try turning those features off to see if that helps. 

You might also try looking to see if there is a later firmware version for the routers (often that is simple to update, but there may be risks that the manufacturer will inform you of).  Sometimes that can make for an improvement of your networking in general, as bugs do happen even on router firmware, and that would potentially resolve the issue here if it's related to outdated router firmware.  I've never heard of that issue happening with AI War before, but I have heard of it being an issue with certain other games and software.  Ironically, in the links for the first two games there, router firmware updates didn't solve the problems, but often it does, and hopefully it will here.

Additionally, if you want to try some more bug testing to really see what is going on, then hit Shift+F3 when you are in the game (on both the client and the host), and it will give you a bunch of information about networking specifically for AI War -- droped packets, resent, etc.  You can look at it in realtime while the game is in its "Waiting for players" state, and see what is actually happening.  Is it choking on resends, waiting for some response that is never there, or something completely else, for instance.

Hope that helps!
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!

Offline jeffhoye

  • Newbie
  • *
  • Posts: 3
Re: "Waiting for players..." when move ~150 ships
« Reply #4 on: October 31, 2009, 12:46:24 pm »
Sweet.  I recorded a video of the network info from the server's perspective.  The "Waiting for players..." problem starts at 1:37 when the withheld messages starts counting up.  Let me know what you think!  You may want to view in HD mode to see the numbers easier.

http://www.youtube.com/watch?v=oIKpJrm9Q0g

Thanks!
-Jeff

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Re: "Waiting for players..." when move ~150 ships
« Reply #5 on: October 31, 2009, 03:53:30 pm »
Sweet.  I recorded a video of the network info from the server's perspective.  The "Waiting for players..." problem starts at 1:37 when the withheld messages starts counting up.  Let me know what you think!  You may want to view in HD mode to see the numbers easier.

http://www.youtube.com/watch?v=oIKpJrm9Q0g

Thanks!
-Jeff

That video was incredibly helpful in at least understanding what is happening, if not yet why -- thanks!

Okay, so here is what is happening -- in the network layer, a packet is getting lost, and then never getting to the destination, so then all of the future packets are withheld.  To give an example of what I mean, suppose we have a message A, that gets split into 5 parts for sending (this is at the packet level, so below the level of the game itself).  So, you have, as messages:

A1
A2
A3
A4
A5

Because of the unreliable nature of UDP, those might arrive in any order, and then they have to be reconstructed as the one big message "A," in the proper order.  Additionally, if some of them are dropped (packet lost), those will be resent, and then message A will be reconstructed.  So, again for example, in a normal case you might get this:

A2 (3ms)
A3 (3ms)
A5 (3ms)
A4 (3ms)
<resend dropped packet> (3ms)
A1 (3ms)

Or something along those lines.  The network library (lidgren network) handles all of this, and basically waits until all those parts are collected, then hands them over to the game, where they are actually reconstructed.  In this example above, you would have 4 withheld messages until A1 finally arrives, and then it would reconstruct and move on.

Given the commonness of some packet loss, this whole scenario is utterly common and resolves itself in less than a quarter of a second, generally speaking.  What is happening in your case, for some reason, is that a huge number of packets are going into the withheld state, because they are still waiting on the resend and arrival of packets that should have come first (so, in other words, A1 never arrives, and everything else after that point just keeps queuing up).

I've never seen this sort of thing with the lidgren library before, so I don't know exactly what the trouble is.  I will send a note to the author of that library and see what he thinks, what possible causes might be.  To my mind, there are a few possibilities:

1. Some sort of bug in the lidgren library (probably unlikely, or others would be reporting it more, but it could be an "edge case" of some kind).
2. Something wrong with one of the network adapters, as I'd noted.
3. Something on one of your two routers that "doesn't like" that A1 packet, and is filtering it out as malicious or whatever else (application-level content filtering on a firewall).
4. Some sort of QoS service on one of your two routers that is incorrectly managing the stream of packets for some reason (very strange that Hamachi would act differently).
5. Some sort of other routing bug, at the ISP level, or one one of the routers, or something (that a router firmware update on both the routers might fix for you, if they are not already up to date).

My suggestion would be to at least look into possibilities 2 through 5, since this is something that -- so far at least -- seems isolated to you and your friend's case alone.  That doesn't mean another case isn't going to come up in another 5 minutes or 5 days or 5 months from now, but with these sort of edge cases, especially with the new data you've provided, my suspicion is something even below the network library layer, getting down to the network driver or further out on the network itself.

Meanwhile, I will contact Lidgren and see if he has any thoughts on the possibility 1 above, since this may prove a useful addition to his FAQ about the library, anyway.  Being that it's the weekend, and that I have not been in communication with him in the past, I don't know what sort of response time to expect -- I'll keep you up to date when I hear anything, though.  Thanks for your patience!
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Re: "Waiting for players..." when move ~150 ships
« Reply #6 on: November 01, 2009, 12:08:54 pm »
I've been emailing with the author of the network library, and he had this to say initially:

Quote
It's hard to say what the problem is; but if withheld packets are growing, it means either an earlier packet is not coming thru for some reason, or, possibly more likely, that an acknowledge isn't coming thru in the other direction.

Let's assume A is communicating with B; and suddenly the host at A is starting to queue up withheld messages. This could mean that B has lost the ability to successfully send messages (such as acknowledges) to A, but A can still send messages to B. It will mean tho, that both peers will time out their connection after a while. This may be the result of one host switching IP or losing router information when behind NAT.

It should be easy to check if this is really the case (unidirectionally failure in the "link") - if other packet is coming thru it's not the case.

Any kind of exception in the library could also be a problem; skipping the acknowledge part or something - but the log should show if this is the case.

So, this pretty much lines up with what I was thinking, except that it could also be that ACKs from the client are not making it back to the server (for the transmission to be complete, the packet has to not only be sent, but it has to be acknowledged as received, given that this is an "unreliable" protocol).

I checked out your video again to see if it looks like the client has stopped sending ACKs to the server, but the sent and received message/packet count keeps climbing for the server, even while the Withheld message count also climbs.  I'm still talking with the network library author, but so far this is pretty puzzling especially since you don't have any error messages logged.  Can you be sure and check on both the client and the server machines to make sure there is nothing at all being logged?  I know you checked the server, but if the client is the one with some sort of exception occurring, then there would be an error on the client side only.  Be sure to check the Game Data Directory in the Settings window on both machines, since the location of where those errors would be logged might be different between the two machines, especially if you are on two different OSes.

It is still looking like a router is the most likely culprit, but it's hard to be certain of much at the moment.  I'll let you know when I hear back from the network library author again.
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!

Offline Vade

  • Newbie
  • *
  • Posts: 7
Re: "Waiting for players..." when move ~150 ships
« Reply #7 on: November 02, 2009, 03:12:29 pm »
There is one very easy way to check to see if it is your local router causing the issue, direct connect to your modem(if possible) and bypass the router altogether. If it still continues then get your friend to disconnect from his router and directly connect to his modem. If it still continues after that it is probably a network card issue or ISP. Having used lidgren my self I have never had this issue in any of my projects either so I really doubt it is the library.

Offline inchoate

  • Newbie
  • *
  • Posts: 4
Re: "Waiting for players..." when move ~150 ships
« Reply #8 on: November 16, 2009, 07:30:43 am »
I'm seeing the same (or very similar) behavior here, between a full version host & trial-mode client. Both versions are whatever Steam is giving out at the moment. We have a savegame (about 45 mins in) that will usually hit the problem within about a minute of play. The host doesn't log any errors (just gets stuck "waiting for players"), but the other player sometimes gets the Base64 exception from the OP. The "Skip Batch" option doesn't help.

This is via a direct internet connection. The ADSL modem on the host side is running OpenWRT with an appropriate port-forwarding rule.

I'll attach a tcpdump capture of UDP traffic on the PPP-over-ATM interface of the modem. This capture was taken after the problem had happened and everything stopped. The 219.xxx side is the host. Looks like there might be a problem with fragmentation of larger messages, at a glance (see frame 24 - it's the first fragment of a datagram - but where's the second fragment?). I'll take a closer look tomorrow, time permitting (I need to check to see what's actually coming in on the ethernet side).

Is there any way to force the net library to use smaller datagrams? FWIW, the ppp interface has a MTU of 1478.
« Last Edit: November 16, 2009, 07:34:55 am by inchoate »

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Re: "Waiting for players..." when move ~150 ships
« Reply #9 on: November 16, 2009, 09:50:34 am »
Can you please try updating to this prerelease?  http://arcengames.com/forums/index.php/topic,2287.0.html

The Base64 exception should be solved by that prerelease, although I do not think that this is related really to your Waiting For Players message.  But it's good to be sure.

The fragmentation of larger datagrams, which is the real issue, is something that seems to happen on just a few network cards and/or routers.  In this thread, for instance, the same problem (different trigger) was solved simply by using a direct connection instead of Hamachi.  Obviously, that isn't your issue here, but I think the general advice from the other thread is the same: make sure your network drivers are as up to date as possible, same with router firmware, on both sides.  It sounds like in your specific case the host network card may indeed be the issue, if the entire datagram is not even it making it from the host to the host router.  So perhaps a simple network card driver update on the host machine would solve yours definitively.

I'll talk to the network library author about ways to force the library to send multiple, smaller datagrams, but that would be a code change on my side rather than something you could toggle.

Hope that helps!
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!

Offline inchoate

  • Newbie
  • *
  • Posts: 4
Re: "Waiting for players..." when move ~150 ships
« Reply #10 on: November 16, 2009, 06:42:06 pm »
The missing fragment is actually a red herring, sorry: I was filtering with "udp port 32320", which misses the 2nd and subsequent fragments because they don't contain UDP port info. The second fragment (with the final 10 bytes of data) did actually get sent. I'll attach an updated capture.

The host's ethernet MTU is 1500, so fragmentation is happening in the router, not on the host.

I wonder if the problem is that the path from the client to the host has a path MTU of 1500 or more all the way up to the ADSL link, and they're getting dropped instead of fragmented at that point (which would indicate an ISP problem). I'll see if I can get a capture from the client side to check that.

I'll try out the prerelease. I'll also try a MTU of 1500 on the PPP side temporarily to see if that helps.

But either way, it'd be helpful to avoid fragmentation if possible. It has a history of being poorly supported in many places (read: ISPs that think that everyone uses only TCP and 1500-byte MTUs). It's also less efficient: assuming that you break up large messages into several datagrams to avoid fragmentation in the first place, then picking the wrong size (as in this case) actually doubles the number of messages that get sent. Maybe dropping the UDP payload size makes sense? Common MTUs are 1500 (ethernet), 1478 (PPPoA), and 1454 (PPPoE), which imply UDP data payload sizes of 1472, 1450, and 1426 bytes respectively. Set it to 1400 and be safe?

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Re: "Waiting for players..." when move ~150 ships
« Reply #11 on: November 16, 2009, 06:56:35 pm »
Thanks for this, your information is very useful, and clearly you know a lot more about low-level UDP than I do, to be honest.  In looking more at the network code, I see that the MTU is set to 1459 as a default for some reason.  I'll down it to 1400 in the next build, which should be out tonight, and we'll see what happens.  If that doesn't simply solve the issue, then I'll make MTU a configurable value so that people can adjust it if they feel they have to.  This sounds like it might be the problem, and might cause performance issues in general depending on the ISP/router in question, though.  In most cases this has not been an issue at all, obviously, but it might explain this specific case.  I'll post again when the update is out (probably in around 4-5 hours).  Thanks for all the sleuthing you've obviously been doing, this is super useful data!
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Re: "Waiting for players..." when move ~150 ships
« Reply #12 on: November 16, 2009, 08:49:32 pm »
Here's the version with the MTU lowered a bit:  http://arcengames.com/forums/index.php/topic,2312.msg14485.html#msg14485

Hope that does it for you!
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!

Offline RCIX

  • Core Member Mark II
  • *****
  • Posts: 2,808
  • Avatar credit goes to Spookypatrol on League forum
Re: "Waiting for players..." when move ~150 ships
« Reply #13 on: November 16, 2009, 10:13:03 pm »
I'm a bit of a n00b when it comes to networking, so what's MTU?
Avid League player and apparently back from the dead!

If we weren't going for your money, you wouldn't have gotten as much value for it!

Oh, wait... *causation loop detonates*

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!