Deployment Topology

This document covers the characteristics and requirements of a Concourse deployment, in a way that is agnostic to how it's deployed. As such we'll use the term "node" to mean anything from a managed VM, to a container, to an individual process (in the event that they may all be running on the same machine).

A typical Concourse deployment has a web node, a worker node, and a PostgreSQL database running somewhere, which we'll call db for the sake of completeness.

Note that this document will use the default ports when describing which nodes talk to which other nodes. If you've changed these defaults, keep that in mind.

The web node

Components: ATC and TSA.

Disk usage: none

CPU usage: peaks during pipeline scheduling, primarily when scheduling jobs. Mitigated by adding more web nodes. In this regard, web nodes can be considered compute-heavy more than anything else at large scale.

Bandwidth usage: aside from handling external traffic, the web node will at times have to stream bits out from one worker and into another while executing Steps.

Memory usage: not very well classified at the moment as it's not generally a concern. Give it a few gigabytes and keep an eye on it.

Highly available: yes; web nodes can all be configured the same (aside from --peer-url) and placed behind a load balancer. Periodic tasks like garbage-collection will not be duplicated for each node.

Horizontally scalable: yes; they will coordinate workloads using the database, resulting in less work for each node and thus lower CPU usage.

Outbound traffic:

  • db on its configured port for persistence

  • db on its configured port for locking and coordinating in a multi-web node deployment

  • directly-registered worker nodes on ports 7777, 7788, and 7799 for checking resources, executing builds, and performing garbage-collection

  • other web nodes (possibly itself) on an ephemeral port when a worker is forwarded through the web node's TSA

Inbound traffic:

  • worker connects to the TSA on port 2222 for registration

  • worker downloads inputs from the ATC during fly execute via its external URL

  • external traffic to the ATC API via the web UI and fly

The worker node

Components: Garden and BaggageClaim.

Disk usage: arbitrary data will be written as builds run, and resource caches will be kept and garbage collected on their own life cycle. We suggest going for a larger disk size if it's not too much trouble. All state on disk must not outlive the worker itself; it is all ephemeral. If the worker is re-created (i.e. fresh VM/container and all processes were killed), it should be brought back with an empty disk.

CPU usage: almost entirely subject to pipeline workloads. More resources configured will result in more checking, and in-flight builds will use as much CPU as they want.

Bandwidth usage: again, almost entirely subject to pipeline workloads. Expect spikes from periodic checking, though the intervals should spread out over enough time. Resource fetching and pushing will also use arbitrary bandwidth.

Memory usage: also subject to pipeline workloads. Expect usage to increase with the number of containers on the worker and spike as builds run.

Highly available: not applicable. Workers are inherently singletons, as they're being used as drivers running entirely different workloads.

Horizontally scalable: yes; workers directly correlate to your capacity required by however many pipelines, resources, and in-flight builds you want to run. It makes sense to scale them up and down with demand.

Outbound traffic:

  • external traffic to arbitrary locations as a result of periodic resource checking and running builds

  • external traffic to the web node's configured external URL when downloading the inputs for a fly execute

  • external traffic to the web node's TSA port (2222) for registering the worker

Inbound traffic:

  • various connections from the web node on port 7777 (Garden), 7788 (BaggageClaim), and 7799 (garbage collection)

  • repeated connections to 7788 and 7788 from the web node's TSA component as it heartbeats to ensure the worker is healthy

The db node

Components: PostgreSQL

Disk usage: pipeline configurations and various bookkeeping metadata for keeping track of jobs, builds, resources, containers, and volumes. In addition, all build logs are stored in the database. This is the primary source of disk usage. To mitigate this, users can configure build_logs_to_retain on a job, but currently there is no operator control for this setting. As a result, disk usage on the database can grow arbitrarily large.

CPU usage: this is one of the most volatile metrics, and one we try pretty hard to keep down. There will be near-constant database queries running, and while we try to keep them very simple, there is always more work to do. Expect to feed your database with at least a couple cores, ideally four to eight. Monitor this closely as the size of your deployment and the amount of traffic it's handling increases, and scale accordingly.

Bandwidth usage: well, it's a database, so it most definitely uses the network (duh). Not much should stand out here, though build logs can result in an arbitrary amount of data being sent over the network to the database. This should be nothing compared to worker bandwidth, though.

Memory usage: similar to CPU usage, but not quite as volatile.

Highly available: up to you. Clustered PostgreSQL is kind of new and probably tricky to deploy, but there are various cloud solutions for this.

Horizontally scalable: I...don't think so?

Outbound traffic:

  • none

Inbound traffic:

  • only ever from the web node, specifically the ATC