Finding Goroutine Leaks in Tests
A leaked goroutine at the end of a test can indicate several problems. Let's first, take a look at the most common ones before tackling an approach to finding them.
First, we can have a goroutine that is blocked. As an example:
In this case, when the context is canceled, the goroutines might end up leaking.
Problem: Leaked Resource
Many times different services, connections, or databases have an internal goroutine used for async processing. A leaked goroutine can show such leaks.
Even if the main loop is properly handled, the conn.handle(msg) could become deadlocked in other ways.
Problem: Lazy Closing Order
Even if all the goroutines terminate, there can still be order problems with regard to resource usage. For example, you could end up depending on a database, connection, file, or any other resource, that gets closed before the goroutine finishes.
Let's take a common case of the problem:
In this case, when the Server is closed, there still could be goroutines updating the database in the background. Similarly, even the Logger could be closed before the goroutine finishes, causing some other problems.
The severity of such close ordering depends on the context. Sometimes it's a simple extra error in the log; in other cases, it can be a data-race or a panic taking the whole process down.
Rule of Thumb
Hopefully, it's clear that such goroutines can be problematic.
One of the best rules in terms of preventing these issues is:
The location that starts the goroutine must wait for the goroutine to complete even in the presence of context cancellation. Or, it must explicitly transfer that responsibility to some other service.
As long as you close the top-level service responsible for everything, it'll become visible in tests because if there's a leak, then the test cannot finish.
Unfortunately, this rule cannot be applied to third-party libraries and it's easy to forget to add tracking to a goroutine.
We could use the total number of goroutines, to find leaks at the end of a test, however that wouldn't work with parallel tests.
One helpful feature in Go is goroutine labels, which can make profiling and stack traces more readable. One interesting feature they have is that they are propagated automatically to child goroutines.
This means if we attach a unique label to a goroutine, we should be able to find all the child goroutines. However, code for finding such goroutines is not trivial.
To attach the label:
Unfortunately, currently, there's not an easy way to get the goroutines with a given label. But, we can use some of the profiling endpoints to extract the necessary information. Clearly, this is not very efficient.
And a failing test might look like this:
The full example can be found here https://go.dev/play/p/KTF9tyEmLor.
Depending on your use case, you may want to adjust to your needs. For example, you may want to skip some goroutines or maybe print some extra information, or have a grace period for transient goroutines to shut down.
Such an approach can be hooked into your tests or existing system in a multitude of ways.