Author Topic: Quick tips for how to avoid multiplayer desyncs from your modding code.  (Read 20175 times)

Offline x4000

  • Chris McElligott Park, Arcen Founder and Lead Dev
  • Arcen Staff
  • Zenith Council Member Mark III
  • *****
  • Posts: 31,651
Hello there!

AI War 2 is intended to automatically recover from desyncs, unlike the first game which died instantly in the event of one.  That said, keeping desyncs to a minimum is a good thing because otherwise players are going to see some pretty strange stuff like ships dying and coming back to life, or suddenly altering position, etc. 

And honestly, depending on the nature of the desync a mod introduces, people may come looking to me about it, because it may manifest "downstream" from your mod.  Think the domino effect, more or less.

So here are a few tips to make things easier on all of us!

Big One: Any Sort Of Randomization
If you're just doing straight out integer math or similar, like 1+1, every computer is going to return the number 2.  Equally important, however, is that when multiple machines are doing a call to a random number generator in some context (sim planning, sim execution, whatever), calling the equivalent of rand() should give the same number on all those machines!

We have set this up for you already, in that we have something called ArcenContext which gets passed around all over the place.  It has a Random on it called RandomToUse, and you can call Next() and NextBool() and so on on that.  If the ArcenContext has been passed into a method, you know a few things:

1. The RandomToUse is in the identical state on all the machines in a multiplayer game unless someone else caused a desync somewhere else.
2. The RandomToUse actually uses a better algorithm than is built into C# by default.  So you'll get a much more pleasing distribution.
3. The RandomToUse is a bit faster, as well, if I'm not mistaken.

If you ever use System.Random.Next() or something along those lines (the stuff built into C#), then you can be assured of a couple of things:

1. You just introduced a desync UNLESS this is in "long range planning" code, which is run only on the host and then communicates via GameCommands to the rest of the game.
2. You also got an inferior quality of random, to be honest. ;)

Big One: Changing The Sim Directly From The Wrong Place
Let's say you're on a background thread that is for planning out what ships will be grabbed by tractor beams.  You can feel free to read most information from the main simulation, but NO TOUCHIE! ;)

A variety of short-term planning threads are running all at once, all looking at the same data.  If one of them makes a change while they are in the middle of that, then you've introduced a race condition and also a desync.  Essentially, different threads will run at different speeds on different machines and different cores, and that's absolutely "random" in terms of how your processors are set up and how the OS is set up, and how they happen to allocate CPU time.

So on machine 1, you might have it where the change from the tractor beam thread happens BEFORE some movement planning thread logic, and thus the movement planning thread logic includes that change.  But on machine 2, it happens after, and thus it's using completely different data.  In reality, machines 1 and 2 would randomly be flipping back and forth with before and after, because there is absolutely no order to these things happening.  By design: if they were dependent on each other, then they couldn't be calculated independently.

So How Do I Change Sim Data?
If you need to change sim data, then there are a couple of methods to do that. 

From short-term planning threads, you can usually return some results in a struct or class that is specifically for saying "here's what I calculated."  The main thread will then synchronously take those results and integrate that into the sim.  This gives a deterministic result.

If you are in a long-term planning thread, that's something that only runs on the host.  So any changes from there would only happen on the host and be an immediate desync if they affect the sim.  These need to communicate with all the players by a different method: creating and sending GameCommands.  There's a way for long-term planning threads to log GameCommands (look at the decollision planning for an example of that), and then the host picks those up later and transmits them to everyone for execution on a later sim frame.  This does mean a delay of about 200ms-400ms, depending on the exact timing of the threads relative to sim steps ending.  That's not something you can speed up or predict, so just be aware the delay is there.

These GameCommands are the same things that get created and queued when you as a player click some button in the interface or right-click with your mouse to tell units to move or attack somewhere, etc. 

The Interface Can't Directly Make Changes, Either
Actually another great way to cause a desync is to let the interface directly touch the sim.  That's something that would change only on the machine of the player who initiated the keystroke or mouse click, so of course multiplayer desyncs immediately.  To work around this, same as with long-term planning the interface has ways to send GameCommands.  See how the lobby does it, or anything in the input handling classes.

Anything Sim-Affecting Can't Use ArcenSettings or GameSettings
This is a big one!  You and I are playing MP together, and our settings are not synced.  Our settings are our own personal preferences, and how mine are configured are personal to me, and vice-versa.

You'll note that there ARE some cases where GameSettings are used to drive behavior, though, and these aren't desyncs for one reason alone: they generate GameCommands from the local machine rather than directly touching the sim.  In other words, the behaviors like "always build 5 engineers on all my planets" is something that is in GameSettings and is personal per player, but it never directly touches the sim.  It generates GameCommands as if the player had directly said "build 5 engineers" via more traditional means.  So all is ok.

You CAN Freely Use World_AIW2.Instance.SetupStoredLongTerm
Or anything on the Factions.  This data is explicitly tied to the campaign being played, and is global to everyone sharing the game.  However, if you were to CHANGE any data in these, then that should either be something that is deterministically synchronous across all machines by nature, or sent by GameCommands if it is at the bequest of one specific machine.

You CAN do anything deterministically synchronous  from the main thread.
Sky is the limit here, change away.  To be deterministically synchronous means that it's just part of the regular sim, on the main thread.  Various methods have a suffix of _MainThread or _OnMainThread in order to let you know that you can change the sim from in there.  As long as you aren't using a rogue random() call (Context.RandomToUse() is still fine here!), or basing your data off of GameSettings or local player input directly... anything you do as part of your algorithm will be the same on all the machines simply by nature.

If you're not sure if you're on the main thread, you can always call Engine_Universal.CalculateIsCurrentThreadMainThead(), and that will return true if you are.  If you're not, then don't touch the sim.  This method is rarely needed, because method call trees tend to always be on the main thread, a short term planning thread, or a long term planning thread, but not all of the above.  So mainly this would be used if you aren't sure where you are.  You can also emit a stack trace to figure that out. ArcenDebugging.ArcenDebugLog( "whatever", Verbosity.ShowAsError ) will automatically print out a stack trace along with the word "whatever".

Floating Point Math Is Bad But Unavoidable
Okay, so when we were absolutely bound to NEVER have a desync, we couldn't use float or double.  It just has too many small differences between machines, and those add up over time.  Desyncs are inevitable.  So we introduced FInt, which is a fixed-integer version of float.

FInt is great in that it will always give you the same result on any machine.  And for certain operations, it's pretty fast.  But it's also BAD in that it isn't hardware-accelerated, and it also has low precision.  So the various trig functions like Sin and Cos and so forth will give you sub-par results, and Sqrt is slow as well as a sub-par result, and so on.  But it's the same on all the machines!

The game is meant to self-correct from small desyncs caused by floating point math.  And because of the nature of the sim versus what is visually displayed, it should probably usually handle this without players ever being the wiser.  So don't be afraid of floats or doubles if they're what is called for precision-wise.  If you're doing vector math, then go the extra mile and use System.Numerics, which is using SIMD under the hood to do 3 to 4 operations for the cost of 1.  UnityEngine.Vector3 is inferior in processing speed in that regard.

So... what is the message here on FInt versus float/double?

1. Basically, if it's something that doesn't need to be super precise in the first place (just a 3-4 significant digits), and is stored for a long time, then FInt is your friend.  It makes sure that there are as few desyncs as possible.  So for most calculations in GameEntityTypeData, we use FInt, for example.

2. If something needs a lot of precision, like the trig required to do knockback or to cause a ship to be halted at the edge of a forcefield, then FInt just absolutely isn't precise enough.  You need to use float or double.  Yes this will probably cause a very minor desync every so often, but it won't be every time and the game will catch that and fix that for you, and that's just considered to be part of the natural game flow.

Conclusion
There are a lot of ways to introduce desyncs, and in fact in a lot of game companies in the past they had a programmer or two dedicated to just reviewing everyone's code to ensure that they weren't introduced.  A stray call to rand() relating to something like birds chirping in Age of Empires 1 caused them to tear their hair out.

It's not a mistake just made by novice programmers; it's something that's a kind of alien way to program unless you're accustomed to lock-step multiplayer networking already, or you're heavily into multithreaded code in a big way already.  Even if you ARE already heavily into that sort of thing, desyncs are an unfortunate fact of life and well after my 10,000 hours of work into AI War Classic I was still introducing them in funky occasional ways.  Keith too.  It just happens.

The good thing about the structure of AI War 2 is that it's meant to resist desyncs and repair itself from them, much like action/FPS games do instead of traditional lock-step games (RTS, 4X).  So there's a certain amount of leeway for us to make mistakes without the sky falling.  A tiny stray desync here or there will really mean next to nothing.

BUT!  If we just have desyncs constantly because people are not taking any care at all, then that's a lot of extra load on the network to fix all those mistakes, and the number of graphical "glitches" (things having health go up and down strangely, tractor beams suddenly switching what they are grabbing, ships suddenly moving to a corrected location, etc) will go up.  Think about how an FPS game gets during periods of high lag, where you see enemy and allied players glitching around the environment more.  That's kind of what would happen here.

So as best we can, avoiding desyncs is great because it keeps things clean and fast and doesn't lean on the desync-correction as a crutch.

Thanks for reading!
Chris
« Last Edit: May 22, 2019, 04:35:52 pm by x4000 »
Have ideas or bug reports for one of our games?  Mantis for Suggestions and Bug Reports. Thanks for helping to make our games better!