Tips for Improving Reliability?

MH6 · March 17, 2021, 8:36pm

We have been using A*PP Pro for some years now and love it, however our latest project is really pushing the limits on what it can handle. We need fairly complex realtime updates to the recast navmesh, and we keep running into problems where a user’s navmesh seemingly nukes itself and all AI stops responding. The AI derives itself from RichAI. We updated to 4.3.41 in hopes it would solve our issues but it did not. We are on 2019.4.10f1

We have taken some steps to alleviate this, such as removing all usage of NavmeshCuts and switching purely to calling UpdateGraphs with the bounds of the areas we need to update. NavmeshCuts was nice and fast but would fail quite easily and randomly, frequently throwing errors such as Too many perturbations.

I’ve moved all graph updates into LateUpdate and I pause all path requests during prescan and resume it during postscan in order to eliminate potential race conditions, which were again causing the path threads to terminate themselves randomly.

Still, we have encountered various bugs that cause the entire astar navmesh to stop working for that play session, such as a start node being invalid when using FloodPath (FloodPath.cs:161 first line in Initialize), which throws an exception and causes the pathfinding system to terminate itself rather than just returning a failed path and continuing on (says something similar to “this should never be reached”). We’ve also had to fix a number of weird issues such as FloodPathTracer attempting to search graphs other than the one the flood path was constructed for, though it’s possible this is working as intended and just doesn’t fit our use case by default. It was easily enough fixed by adding this overload to FloodpathConstraint

		public override bool SuitableGraph(int graphIndex, NavGraph graph)
		{
			return path.startNode.GraphIndex == graphIndex;
		}

I’m still trying to narrow down the culprit as to why the navmeshes keep corrupting themselves–it is one of our most frequent bug reports–but if you have any leads or ideas on how I can improve the fault tolerance of Astar please let me know.

We’d be more than happy to pay for paid support if you offer it, just let us know your rates.

Thanks for your time

MH6 · March 18, 2021, 1:50am

We are also experiencing an issue with TriangleMeshNode.ContainsPoint, which again causes all pathfinding to stop working. I’m thinking it’s because the navmesh is updated and the TriangleMeshNode is deleted in between me calling the path and processing it. I will be putting in a fix for this, but it’d be nice if it’d just return an error or false instead of killing pathfinding.

IndexOutOfRangeException: Index was outside the bounds of the array.
Pathfinding.NavmeshTile.GetVertexInGraphSpace (System.Int32 index) (at Assets/AStarPathfindingProject/Generators/Utilities/NavmeshTile.cs:63)
Pathfinding.NavmeshBase.GetVertexInGraphSpace (System.Int32 index) (at Assets/AStarPathfindingProject/Generators/NavmeshBase.cs:179)
Pathfinding.TriangleMeshNode.GetVerticesInGraphSpace (Pathfinding.Int3& v0, Pathfinding.Int3& v1, Pathfinding.Int3& v2) (at Assets/AStarPathfindingProject/Generators/NodeClasses/TriangleMeshNode.cs:109)
Pathfinding.TriangleMeshNode.ContainsPointInGraphSpace (Pathfinding.Int3 p) (at Assets/AStarPathfindingProject/Generators/NodeClasses/TriangleMeshNode.cs:207)
Pathfinding.TriangleMeshNode.ContainsPoint (UnityEngine.Vector3 p) (at Assets/AStarPathfindingProject/Generators/NodeClasses/TriangleMeshNode.cs:200)

MH6 · March 19, 2021, 4:01pm

We are also using the graph serialization at runtime to save the navmesh when the player saves, and this freezes for up to two minutes sometimes (I call the same code as you do in the editor tools)

            var serializationSettings = new Pathfinding.Serialization.SerializeSettings();
            serializationSettings.nodes = true;
            // Save graphs
            var bytes = AstarPath.active.data.SerializeGraphs(serializationSettings);

            System.IO.File.WriteAllBytes(file, bytes);

It was sometimes locking up for two minutes for a graph that was only 500kb, so I’m guessing it was waiting on threads or something.

We’ve also had some improvements with stability by adding a GraphModifier and pausing all new pathfinding requests while graph updates are being processed. I’m pretty sure you already do this internally but it seems to help regardless.

aron_granberg · March 19, 2021, 10:29pm

2 minutes!? That’s a looong time. How many nodes do you have in that graph?

MH6 · March 25, 2021, 6:17pm

It’s not a huge amount of nodes really. The filesize is only ~500kb or so. I think the long wait times are threading related personally. We are getting a lot of user reports of freezing during saves for large parks (specifically freezing during graph serialization). I’m thinking there is possibly a deadlock going on with the threads when this happens. Do you know an easy way I can force all the threads to pause so I can save the graph reliably?

Currently (as far as we can tell) most of our more game breaking bug reports stem from the navmesh. It seems to just stop working after a while. AI can no longer calculate paths, and graphs can no longer be serialized during save. We’ve just released a closed alpha so this is the first time we are getting large scale reports from a wide range of different hardware setups.

What we currently suggest to our players that encounter this issue is to rename their navmesh.data file (which is what we serialize the graphs to), because when loading a save if this file is not found we do a full scan (same code as when pressing the Scan button in the Astar editor). This usually fixes the navmesh, but not always.

Our game is a dinosaur zoo builder game, so the player has a lot of control over the level, which requires frequent realtime navmesh updates. Unfortunately after a while the navmesh tends to break, resulting in freezing when saving (specifically during graph serialization) and other odd issues. I know we are probably an unusual use case and most of your users don’t experience these problems.

You can see some example footage of our game here. The actual game level is just a flat grassy terrain with some trees, that entire park is player built.

I am trying to fix these issues on my own, but if you have any suggestions or ideas I’d greatly appreciate it. If you offer paid support I’d also be happy to pay you for a couple hours of your time. But I also understand if this is nothing something you wish to do.

aron_granberg · March 25, 2021, 11:02pm

Hi

The only thing I can think of is that for some reason you end up with some path requests that take a really long time to complete. When serializing the graph it will pause the pathfinding threads, and when doing that it will wait until all paths that are currently being calculated have been calculated (only those that the threads are working on, not all paths in the queue).
If you can add telemetry to your beta you can post events if some paths take a really long time to calculate. You can check this using the Path.duration field (https://arongranberg.com/astar/docs/path.html#duration).
This also seems consistent with what you say about it breaking after a while. Though it seems odd that a re-scan would fix it.

MH6 · March 26, 2021, 2:10am

I was able to get someone’s save that exhibits the issue. I will put in checks for duration, though I thought there was a max timeout for paths already? If not I’ll add one.

This is what the game was waiting on when it froze during save. Perhaps it is a path calculation that is failing, most of the time path calculations are done super quick. I’ll look into it further.

ThreadControlQueue, line 191. Inside the Pop function.

It also looks like the PathProcessor::Lock function itself gets stuck waiting for queue.AllReceiversBlocked to return true
astarloop

So it basically never makes it past the active.PausePathfinding() call in AssertSafe. It never actually returns the graphLock. blockedReceivers is stuck at 2, whereas numReceivers is 16.

MH6 · March 28, 2021, 2:27am

So upon examining your code further, I realize the blocking is intentional, and the real issue was that not all the threads would actually block–so it’d never continue with the saving, because AllReceiversBlocked would keep returning false.

I believe this was because some path requests were still coming in to be calculated, which would unblock the thread, so I put a conditional in AstarPath.StartPath where I check if pathfinding is paused, and if so it just does a path.FailWithError.

This seems to have fixed the issue for me, I’ll know more when I put an update out to everyone. I’m hoping this will fix the issue with the navmesh breaking as well, cause I think that’s related to requesting paths while the navmesh is being updated. I just wasn’t doing my original pause condition checks in a low enough level I think, and paths were sometimes still being calculated.

aron_granberg · March 31, 2021, 9:10am

Hmm, that’s odd. The code calls queue.Block() which should cause all pathfinding threads to complete whatever they were working on right now and then mark themselves as blocked and wait for the queue to be unblocked. This should cause AllReceiversBlocked to become true regardless of if new pathfinding requests are coming in or not.

Are you requesting paths from other threads? Otherwise, I don’t see how AstarPath.StartPath could be called while it is waiting for the receivers to be blocked?

It’s great that you found a solution though

MH6 · April 6, 2021, 5:40am

No I do not request paths from other threads, only the main thread. Also I load/save graphs in coroutines if that matters. Unfortunately my problems persist.

I do a ScanAsync when the player does not have a navmesh cached in their save file already (or the navmesh settings were updated between versions so we need to regenerate it). So I can show them some visual feedback via a progress bar.

This would randomly fail at runtime. Especially if I load into a game, and then load again. It throws an exception in ScanInternal while calling if (!coroutine.MoveNext()) break; I forget the exact exception.

To fix it, I call LoadFromCache to reset it to the default level navmesh prior to doing ScanAsync. This atleast makes ScanAsync more reliable.

Problem is, this brings me right back to square one, where LoadFromCache just hangs for up to 8 minutes. I get no feedback as to what it is doing, pausing in Visual Studio just says its executing non compatible code. I pause all pathfinding requests prior to loading/scanning and even wait a couple seconds after.

I really hope these issues go away when we upgrade to 2020.3, because the threading seems to have a lot of issues currently. I’m about ready to just do away with in-game loading entirely and reload the entire unity scene files, even though that will drastically increase loading times. Everything else can be reloaded at runtime very quickly except the navmesh.

Is there no way to just force Astar to terminate literally everything so I can load a new graph and resume without having to wait over 8 minutes? I’m at my wit’s end here.

aron_granberg · April 6, 2021, 1:07pm

Hi

8 minutes is ridiculously long. Can you replicate this with a saved graph file? If so, would it be possible for you to share that graph file with me?

Probably you are trying to load the game and then re-load it while you are scanning the graph. That will cause issues. I would recommend that if that happens you should block until the scan is complete before trying to load other graphs. Just dropping the coroutine could cause issues because it might own things like native arrays and other things that need to be disposed of properly.

MH6 · April 6, 2021, 4:45pm

When I load again after loading in, the scan has already completed. I display a loading bar when ScanAsync is running, once that goes away the AI start pathfinding. Then I load again and it fails (or takes a very long time if I include the LoadFromGraph call first).

After having slept on it, I did some more digging.

The LoadFromGraph call was stalling in AssertSafe(), when calling active.FlushWorkItems()
It was processing quite a few (~300) GraphUpdateObjects.

During load, I delete all the existing paths/fences on the map, which subsequently (and incorrectly) called graph updates for their Bounds. I put in a conditional check to skip this now if it is during load (vs the player deleting a path/fence during gameplay), since there is no point in updating graphs I’m about to erase and reload. So there are no longer a bunch of work items in queue, and LoadFromGraph no longer takes a long period of time to execute. Thus far it seems to be loading graphs much faster overall than before. This also explains why the first load never seemed to exhibit the problem (there were no existing items to remove, and thus no navmesh updates)

So we can chalk this up to just me being stupid, but I’d still love a way to basically force AStar to stop everything it is doing and “start fresh” so to speak. There is no point in waiting on a bunch of navmesh updates when the graphs will be thrown out anyway immediately after. I understand if that’s not worth the effort it’d take to implement though.

Hopefully this is the last of my troubles