Author Topic: Multi-core CPUs are not very well utilized for large fleets of ships. (Read 12676 times)

x4000 · « **Reply #30 on:** February 17, 2010, 09:03:01 pm »

Okay, commutativity -- I'm much better at programming than math. To be honest, the Quake 4 method of multiplayer is something that I find intimidating at the moment. It's the "simpler" approach, but it's not what I know, so it's more intimidating. That only works for games with a fewer number of agents, and revolves around constantly-diverging simulations that are then synced in some manner a few times per second. Thus you wind up with predictive movements, and situations where a player in an FPS game who has high lag might seem to jump around your screen, etc. In that sort of environment you don't have to worry about desyncs, though, since the simulation data is small enough to be continuously resynced so frequently -- but that's why you never see ghosting or jumping of ships in an RTS game, is because the networking model is so different from action games.

Mylon · « **Reply #31 on:** February 18, 2010, 04:00:51 pm »

What about having each system run in its own thread? There will be time where I'm not even in combat and the game runs at a crawl, from the devourer golem or whatever else is going on. Or if I'm trying to defend multiple systems at the same time this would help speed things up.

x4000 · « **Reply #32 on:** February 18, 2010, 04:16:29 pm »

Quote from: Mylon on February 18, 2010, 04:00:51 pm

What about having each system run in its own thread? There will be time where I'm not even in combat and the game runs at a crawl, from the devourer golem or whatever else is going on. Or if I'm trying to defend multiple systems at the same time this would help speed things up.

I noted about this one before, higher up. Generally that's not going to provide much of a benefit, since it tends to be intra-system stuff that eats up the most CPU. I don't know what version of the game you are running, but if you're playing solo and not in combat, the dyson really should not add much. If your allies are in combat, that is the same as you being in combat, in multiplayer.

The thing about this is that, yes, the ability to split via planet would be of some help, in some situations, but it would be an incredibly minor help in the vast majority of cases. And given the massive overhead of splitting out the threads (either lots of locking issues, or a message-based queue to send data between threads), this would add a baseline overhead that is almost certain to exceed whatever incremental savings you get in most cases.

In other words, it would run significantly faster if you had two huge battles going on on two different planets at once (assuming you have multiple cores), but otherwise it would run notably more slowly in any other situation. And for those on single cores, it would simply run more slowly, period. And that's after we'd have to rejigger the Randoms and such to be per-planet, which would be no mean feat with ships moving between planets, etc.

TheSilverHammer · « **Reply #33 on:** February 19, 2010, 09:55:46 am »

I do not think it needs to be that complicated. You can easily parallelize anything and synchronization isn't too hard if you follow some basic steps. Lets take an abstract example:

You have a function called DoStuff(). This takes a large chunk of data, does something with it and then returns with the answer. Currently at some point our simulation halts or begins to run slowly because DoStuff() is taking longer and longer to return. This is important to remember, the rest of the sim is waiting, on DoStuff().

So lets make DoStuff() multi-threaded / cored. We will now create another function, to represent what happens inside of DoStuff(), called DoSomething(). It represents the actual work being done and DoStuff() is now a wrapper for DoSomething();

When we initialize our program, DoStuff() does its OneTimeInit() which includes setting of N threads for N Cores. It only does this once, so we do not keep making and unmaking threads. DoStuff() creates N instances of DoSomething(), again done at Init time.

Now when DoStuff() gets it's chunk of data it needs to do work on, it subdivides it into N slices, one for each DoSomething() instance. It hands each chunk to an instance of DoSomething() and each thread begins doing work (presumably on a different core if the windows task scheduler is not brain dead).

Now DoSomething goes into a loop and waits (or you can use callbacks or synch objects or whatever you want). It waits for each thread to say it is done. If thread 4 finishes first, it just waits until everybody is done. When everyone is done, DoStuff() reassembles the data (which may not be needed depending on how you set up the data for DoSomething() )and returns with its answer.

Nothing outside of DoStuff() has to be aware the internal operation was done in parallel. Everyone was waiting on DoStuff() before to finish, so they just wait for the threads to synchronize and finish their work. No other code needs to change. Even the DoSomething() code will only need minor changes depending upon how it was written.

Ill grant you that there is another case which can be problematic, but isn't as bad as it looks. If the operations of DoSomething() is dependent upon what other instances of DoSomething() are doing. An example of this is a sort.

Lets take a bubble-sort as an example. If you were to break your data up into 4 chunks (4 cores) and use this methodology, you would end up with 4 sorted lists although you would NOT end up with the "Big list" sorted. This looks daunting, but it is not. It is easily resolved.

(new post due to this forum software bugging if posts get too long [ maybe it is an IE 8 issue))

TheSilverHammer · « **Reply #34 on:** February 19, 2010, 10:06:20 am »

Lets say we have a list of numbers, 1 to 100 in reverse order and we want to bubble sort using 4 cores. We take our list and hand it to each of our bubble sort instances. We will say this list is a single chunk of memory for efficiency's sake. We pass in the indexes of the area of this list each instance of our bubble-sort is responsible for. Only due do the nature of the bubble sort, we know that each list can not operate in a vacuum.

To this end, we have an overlap in data. Sort 1 does 1 to 25, Sort 2 does 25 to 50, etc... Now we have Sort 1 and 2 sharing element 25. All we need to do is be smart and when we are operating on an end point element, such as 25 or 50 if you are "Sort 2", you need to lock your end-points to make sure that Sort 1 isn't doing something with element 25 while Sort 2 is. You do not need to lock the rest of the elements because that is inefficient and we know they are safe since no one else will be touching them.

Now you just proceed as normal, as stated above and you will end up with "The Big List" sorted 4x faster than with a single core.

I will grant you that I do not know what algorithm you are using that is causing the bottleneck, and perhaps it really is very complex. I can't think of anything that to complex to figure out how to make it multi-core friendly. However, I would find such a thing to be a fun challenge to figure out how to parallelize.

keith.lamothe · « **Reply #35 on:** February 19, 2010, 10:18:34 am »

TheSilverHammer, do you understand that the main issue is that each player's computer needs to produce exactly the same simulation results, and that two simulations performing even minor operations in a different order results in desync?

Black · « **Reply #36 on:** February 19, 2010, 10:19:33 am »

Your outlook is excessively optimistic to the point of naiveté. Parallelization of general algorithms is not only a highly unsolved engineering problem, it is theoretically unsolved. There is no theory that tells you how to parallelize any large class of algorithms. Your example and line of reasoning only works for the most straightforward of data manipulation algorithms. I would bet a fair sum of money that it would not work for AI war given its current architecture.

http://en.wikipedia.org/wiki/NC_(complexity)

NC !=(?) P

x4000 · « **Reply #37 on:** February 19, 2010, 10:44:22 am »

What Black and Keith said. And, also, you have the performance cost of either passing data, or locking data, to also overcome. It's not that there is one problem with the parallelization approach for a game like this. There are a basically three (that I know of, but possibly more). And any one of those three are showstoppers:

1. Simulation results must be deterministic, regardless of computer.
2. Splitting out the parts algorithms that have cumulative results does not work in a general sense (what Black said).
3. Memory contention issues abound when you have a lot of threads needing to access and change many points of data on the same object. Abound to the point of sucking away whatever performance gain parallelization is likely to give.

Black · « **Reply #38 on:** February 19, 2010, 11:00:45 am »

All that said, how many per-frame tasks actually depend on each other?

Does collision detection depend on anything but the initial ship data at the beginning of the frame? If not, then with a modification of architecture of the engine, they could possibly be split up into two threads. If you're doing any sort of interpolation within the frames, though, this will not be compatible. But collision detection already is asynchronous IIRC from one of your articles, as it is only done in chunks per frame. Having a thread handle it on its own seems to be a possible avenue of parallelism. Obviously there is a problem with syncing in multiplayer, but that does not seem insurmountable if you focus on collision detection only.

if every player only did collision detection for themselves, with the host doing collision detection for the AI as well, then the results can be passed the same way as player issue commands are to the other players. Since no two players are computing the same collision detection data, there is no issue of synchronization. The obvious downside is bandwidth consumption as collision detection moves are bound to be as numerous if not more numerous than regular player issued moves.

But that can only be concluded by testing, and I'm sure that the Arcen programmers, who have shown the ability to make deep and clever optimizations, could cut down the bandwidth consumption to acceptable levels.

There is only one drawback to exploring this option that I can think of, and that is not to diminish how weighty this drawback is. In fact, it is so significant that it essentially rules out this experiment without second thought. Implementing any sort of attempt at multithreading, such as the one I conjectured in this post, taks man-hours. Arcen really can't afford to waste man-hours; they have to be allocated as efficiently as possible. It is probably far more important for the focus to be on Tidalis development than potentially gutting AI War's engine to make way for a minor parallelism bonus.

But ideas are always good. Perhaps this line of thought *is* valid, and perhaps it can be generalized to other aspects of the game loop. It's probably not worth thinking about, though, considering the current priorities of the development team.

keith.lamothe · « **Reply #39 on:** February 19, 2010, 11:25:37 am »

Yes, this is a very valid and worthwhile line of discussion, as it is good to know "where we would try to start" if further multi-threading became more of a priority, and certainly not all parts of the computation work are equally problematic (though nearly all of it is at least somewhat problematic, as noted before).

I just ask that folks actually listen when we explain the details of why our particular model does not yield well to standard parallelizing methods, etc. Yes, I know bubble sort can be changed to something much more multi-threadable. Yes, I know that generally this and other methods can be very helpful with harder problems. But this kind of multiplayer game is a different story. As long as everyone actually listens, this is fruitful to talk about.

NickAragua · « **Reply #40 on:** February 19, 2010, 11:32:46 am »

Yeah, if you're going to write your application to make use of multiple cores, you have to do so right from the design phase, before you start writing any code. If you decide to do that after the fact, you might as well just make a new game (ding ding ding), since, as others (game devs, Black) have pointed out, changing the game to take advantage of more cores than it already is would basically mean ripping out the whole engine and start over. Ever do a full engine rebuild on a car? Well, this is just as complex, if not worse.

x4000 · « **Reply #41 on:** February 19, 2010, 12:08:34 pm »

Yes, no matter what way you cut it, this is not a priority for now -- even if we figured out exactly what to do, and it was an obvious win, that wouldn't happen until later this year when we're working on the second expansion (any updates, of course, would be free for all players, as per usual). And it will be a guts replacement, as NickAragua says.

As usual, I am letting myself get way to sucked into a discussion that I shouldn't let take up so much of my time by this stage, so I'm going to say one last bit and then get back to work on Tidalis for now. But, others feel free to keep discussing. A couple of points:

Why The Collision Detection Is Dependent On Other Stuff
In answer to Black's question, which was very well laid out, I'm going to try to explain something very complex in a concise manner. For those following along, please try to read this very carefully and really understand what I'm saying. It's more complex than it seems at first blush.

Okay, to sum this section up as concisely as I can: the simulation must give a deterministic result, and thus everything -- everything must happen in the exact same order on every computer, regardless of any other factors.

In other words, if Ship A collides in Frame 1, it is using the game state from Frame 1. If Ship A collides in Frame 2, the game state has likely changed in subtle ways. This may lead to any of the following, which causes a desync:

- Different random values coming up (completely assured).
- Ships being in slightly different positions, and so colliding slightly differently (not all that common, given the precision of time we are talking about here, but will happen frequently enough to cause desyncs at least once an hour, if not more frequently).
- Ships having died, gone through wormholes, or whatever other status changes that cause them to not collide or collide differently, and which then cause desyncs.

Now, in thinking about this, I hope everyone will extrapolate out to the rest of the program. This applies to targeting, to... well, there isn't anything it doesn't apply to. Now, hold on, I know that there's a subsequent argument brewing, and I'll get to that one:

Splitting Tasks Per Client, Or Taking Things Out Of The Simulation
Okay, so Black's argument was that collision detection could be lifted off the main thread and done on worker threads. After all, it only depends on all the ships on a planet and their positions and any statuses that would cause them not to collision-detect, right? I say that non-facetiously; compared to the targeting requirements, that's no data at all.

Here are the problems:

1. For this to be deterministic, the working data has to be exactly the same on all the clients when it is passed over. This essentially makes one thread completely dependent on the other, which introduces all-new bottlenecks. Something might could be done with this, but it wouldn't be all that pretty and performance would likely be questionable even with so (comparably) few data points for this.

2. If you make it per-client, then this is something that is no longer part of the simulation. Literally, in the same sense that the AI Thread that currently exists in AI War is not part of the simulation. When you take something outside of the simulation, that means it no longer has to be deterministic, which is great because it lets the main simulation keep chugging away without need to wait on the main thread.

However, #2 does have downsides, most notably command lag. If something is outside the simulation, then the only way that it can impact the simulation is by issuing internal "game commands" that get sent to the server, then scheduled for execution on the next turn that the server sends to all clients. Thus everybody still winds up integrating those commands into the simulation at exactly the same time. There is a 400ms round-trip time sending stuff like this from the clients to the server and then back. That may not sound like a lot, but in terms of collision detection or auto-targeting, that's a huge amount of lag.

Case in point, this is the normal lag of the AI in making decisions -- except that it is even worse for the AI, because I only allow it to send 200 game commands per game turn (200ms), in order to keep it from bogging down the network with all its commands (which it definitely does in the absence of that speed cap). We compress, we combine, etc, but there is still no way to get around the volume of data that has to send. In terms of the AI Thread, this sort of added lag is actually good, because it makes the AI seem a bit more human-like. It might take it a couple of seconds to respond to something a human does, which seems fair, right?

But: that is only fair because, like human ships, the AI ships have autotargeting, FRD mode, etc, which can make them react in a more automated way prior to getting their real commands from the AI overlords (same as your ships act in an automated way prior to getting their commands from you). Collision-detection, compared to the autotargeting, is really almost nothing of a load. Yes, collision-detection is probably the second-highest load of the simulation thread, but it's so much less than autotargeting that it's not even in the same class of load.

Okay, so take autotargeting onto the AI thread, then, right? That should be possible, and easier than one might expect since all the needed data is already on that thread, thus avoiding the addition of any of that nasty added syncing-related performance load I've been talking about. In fact, I went through a lot of work back in late November converting the game to do this, but the results were so disastrous that I reverted it. Here's what happened in my fully-functioning tests:

1. All autotargeting was moved to the AI thread, which meant that until ships got a command from the AI thread, they would just sit there. Not a problem if the interval between AI thread data sends is 200ms, right? That's what I thought I could work off of.

2. Because of all the extra commands being sent for autotargeting purposes, however, and because of the (very much needed) 200-commands-per-game-turn cap, this caused an incredible bottleneck, though. Often autotargeting commands wouldn't arrive for 10-15 seconds or more. This meant that ships that happened to luck into getting earlier autotargeting commands had an incredible advantage, as they could basically completely wipe the other side off the face of the map before the other side even knew to fire.

3. To experiment with countering this on just my local machine -- single player, now, not even multiplayer -- I took the AI thread command cap off. The flood of data messages that were needed in order to run autotargeting at anything approaching full speed was such that it made the game all but unplayable even in really lightweight scenarios that presently the game laughs at.

Could I have made optimizations to the above approach I was experimenting with? Undoubtedly. But that doesn't escape the fact that there is simply too much data to push through the network pipe like that, and so you can't take autotargeting out of the simulation and expect a game at all like what the current game is. Anything that talks about splitting stuff per-client, just running some activities on one machine versus another, is going to fall under this category. Basically, if you add in voice chat (which is common), plus all the commands that players tend to issue along with what the AI issues, the game is using about as much network bandwidth right now as I care for it to for minimum system requirements purposes.

So, in the end, that takes us back to the main kind of questions, which is basically splitting actual simulation activities amongst multiple threads that are all "part of the simulation" and thus expected to give the same result on every machine. And for an explanation of all the various pitfalls of an approach of that sort with this type of game, please see all the stuff I've already written in this thread, along with the "Why The Collision Detection Is Dependent On Other Stuff" section above.

Black · « **Reply #42 on:** February 19, 2010, 12:12:34 pm »

Well the numbers don't lie, so I think that settles the question fairly conclusively

x4000 · « **Reply #43 on:** February 19, 2010, 12:23:34 pm »

Personally I do think there is a solution out there somewhere, but I think it's going to be something really non-obvious and engine-specific. It will have to be in splitting the simulation into multiple threads that somehow give a deterministic result -- given the constraints of everything else, that's the only possible answer.

I guess that sounds counter to everything I've been saying before in this thread, but I don't mean it to be. I've always thought that something like this might be something that is possible, but my points are mostly:
1. This is incredibly nontrivial.
2. No-one else has ever solved this problem to my knowledge.
3. Someone who is not intimately familiar with the AI War architecture, or at least some other multiplayer RTS architecture to the point of having coded a significant portion of one, is unlikely to solve this problem. Because if it were that easy to solve from an armchair, then hundreds of really smart programmers at various studios would already have done so (and I like to think I would have, too).

AI War has already had a number of technical firsts in terms of unit counts, simulation throughput, and so forth. We push units through the simulation like FPS games push polygons through the GPU, and that's before you even get into all the AI firsts. So I have hopes, perhaps arrogant hopes, that I might be able to crack this problem at some point in the future, if I have an epiphany of some sort and then the time to experiment with it (read: not primarily being involved in development of a different game at the time). I have that sort of epiphany every so often, as people who follow the game's development know, and hence a lot of the performance gains over even what we'd already accomplished.

But this multithreaded simulation on is on a different sort of level from most of those things I have done in the past; it might even take upgrades to the general .NET threading models before it is possible. And, even before that, it's quite possible that simply pursuing single-thread optimizations to the algorithm will lead to better results for more people. In other words, I'd like to support single-threaded machines also, and so coming up with new clever ways to minimize the impact of the main simulation in complex scenarios is more of a priority because it benefits everyone regardless of hardware.

At any rate...

Remagen · « **Reply #44 on:** February 19, 2010, 07:46:42 pm »

Lol...

Every time I read this thread I lol; x4000 repeatedly explaining (in posts of increasing length) why he can't do much more multithreading, while legions of fans insist that it a simple matter of willpower.

I reject your reality, and substitute my own.

News:

Author Topic: Multi-core CPUs are not very well utilized for large fleets of ships. (Read 12676 times)

x4000

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

Mylon

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

x4000

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

TheSilverHammer

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

TheSilverHammer

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

keith.lamothe

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

Black

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

x4000

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

Black

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

keith.lamothe

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

NickAragua

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

x4000

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

Black

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

x4000

Re: Multi-core CPUs are not very well utilized for large fleets of ships.

Remagen

Re: Multi-core CPUs are not very well utilized for large fleets of ships.