Deployment Topology
This document covers the characteristics and requirements of a Concourse deployment, in a way that is agnostic to how it's deployed. As such we'll use the term "node" to mean anything from a managed VM, to a container, to an individual process (in the event that they may all be running on the same machine).
A typical Concourse deployment has a web
node, a worker
node, and a PostgreSQL database running somewhere, which we'll call db
for the sake of completeness.
Note that this document will use the default ports when describing which nodes talk to which other nodes. If you've changed these defaults, keep that in mind.
The web
node
Disk usage: none
CPU usage: peaks during pipeline scheduling, primarily when scheduling jobs
. Mitigated by adding more web
nodes. In this regard, web
nodes can be considered compute-heavy more than anything else at large scale.
Bandwidth usage: aside from handling external traffic, the web
node will at times have to stream bits out from one worker and into another while executing Steps.
Memory usage: not very well classified at the moment as it's not generally a concern. Give it a few gigabytes and keep an eye on it.
Highly available: yes; web
nodes can all be configured the same (aside from --peer-url
) and placed behind a load balancer. Periodic tasks like garbage-collection will not be duplicated for each node.
Horizontally scalable: yes; they will coordinate workloads using the database, resulting in less work for each node and thus lower CPU usage.
Outbound traffic:
db
on its configured port for persistencedb
on its configured port for locking and coordinating in a multi-web
node deploymentdirectly-registered
worker
nodes on ports7777
,7788
, and7799
for checkingresources
, executing builds, and performing garbage-collectionother
web
nodes (possibly itself) on an ephemeral port when a worker is forwarded through the web node's TSA
Inbound traffic:
worker
connects to the TSA on port2222
for registrationworker
downloads inputs from the ATC duringfly execute
via its external URLexternal traffic to the ATC API via the web UI and
fly
The worker
node
Components: Garden and BaggageClaim.
Disk usage: arbitrary data will be written as builds run, and resource caches will be kept and garbage collected on their own life cycle. We suggest going for a larger disk size if it's not too much trouble. All state on disk must not outlive the worker itself; it is all ephemeral. If the worker is re-created (i.e. fresh VM/container and all processes were killed), it should be brought back with an empty disk.
CPU usage: almost entirely subject to pipeline workloads. More resources configured will result in more checking, and in-flight builds will use as much CPU as they want.
Bandwidth usage: again, almost entirely subject to pipeline workloads. Expect spikes from periodic checking, though the intervals should spread out over enough time. Resource fetching and pushing will also use arbitrary bandwidth.
Memory usage: also subject to pipeline workloads. Expect usage to increase with the number of containers on the worker and spike as builds run.
Highly available: not applicable. Workers are inherently singletons, as they're being used as drivers running entirely different workloads.
Horizontally scalable: yes; workers directly correlate to your capacity required by however many pipelines, resources, and in-flight builds you want to run. It makes sense to scale them up and down with demand.
Outbound traffic:
external traffic to arbitrary locations as a result of periodic resource checking and running builds
external traffic to the
web
node's configured external URL when downloading the inputs for afly execute
external traffic to the
web
node's TSA port (2222
) for registering the worker
Inbound traffic:
The db
node
Components: PostgreSQL
Disk usage: pipeline configurations and various bookkeeping metadata for keeping track of jobs, builds, resources, containers, and volumes. In addition, all build logs are stored in the database. This is the primary source of disk usage. To mitigate this, users can configure build_logs_to_retain
on a job, but currently there is no operator control for this setting. As a result, disk usage on the database can grow arbitrarily large.
CPU usage: this is one of the most volatile metrics, and one we try pretty hard to keep down. There will be near-constant database queries running, and while we try to keep them very simple, there is always more work to do. Expect to feed your database with at least a couple cores, ideally four to eight. Monitor this closely as the size of your deployment and the amount of traffic it's handling increases, and scale accordingly.
Bandwidth usage: well, it's a database, so it most definitely uses the network (duh). Not much should stand out here, though build logs can result in an arbitrary amount of data being sent over the network to the database. This should be nothing compared to worker bandwidth, though.
Memory usage: similar to CPU usage, but not quite as volatile.
Highly available: up to you. Clustered PostgreSQL is kind of new and probably tricky to deploy, but there are various cloud solutions for this.
Horizontally scalable: I...don't think so?
Outbound traffic:
none
Inbound traffic:
only ever from the
web
node, specifically the ATC