Concourse runs everything in isolated environments by creating fresh containers and volumes to ensure things can safely run in a repeatable environment, isolated from other workloads running on the same worker.
This introduces a new problem of knowing when Concourse should remove these containers and volumes. Safely identifying things for removal and then getting rid of them, releasing their resources, is the process of garbage collection.
Let's define our metrics for success:
Safe. There should never be a case where a build is running and a container or volume is removed out from under it, causing the build to fail. Resource checking should also never result in errors from check containers being removed. No one should even know garbage collection is happening.
Airtight. Everything Concourse creates, whether it's a container or volume on a worker or an entry in the database, should never leak. Each object should have a fully defined lifecycle such that there is a clear end to its use. The ATC should be interruptible at any point in time and at the very least be able to remove any state it had created beforehand.
Resilient. Garbage collection should never be outpaced by the workload. A single misbehaving worker should not prevent garbage collection from being performed on other workers. A slow delete of a volume should not prevent garbage collecting of other things on the same worker.
The garbage collector is a batch operation that runs on an interval with a default of 30 seconds. It's important to note that the collector must be able to run frequently enough to not be outpaced by the workload producing things, and so the batch operation should be able to complete pretty quickly.
The batch operation first performs garbage collection within the database alone, removing rows that are no longer needed. The removal of rows from one stage will often result in removals in a later stage. There are individual collectors for each object, such as the volume collector or the container collector, and they are all run asynchronously.
After the initial pass of garbage collection in the database, there should now be a set of containers and volumes that meet criteria for garbage collection. These two are a bit more complicated to garbage-collect; they both require talking to a worker, and waiting on a potentially slow delete.
Containers and volumes are the costliest resources consumed by Concourse. There are also many of them created over time as builds execute and pipelines perform their resource checking. Therefore it is important to parallelize this aspect of garbage collection so that one slow delete or one slow worker does not cause them to pile up.