This document covers the characteristics and requirements of a Concourse deployment, in a way that is agnostic to how it's deployed. As such we'll use the term "node" to mean anything from a managed VM, to a container, to an individual process (in the event that they may all be running on the same machine).
A typical Concourse deployment has a
web node, a
worker node, and a PostgreSQL database running somewhere, which we'll call
db for the sake of completeness.
Note that this document will use the default ports when describing which nodes talk to which other nodes. If you've changed these defaults, keep that in mind.
Disk usage: none
CPU usage: peaks during pipeline scheduling, primarily when scheduling
jobs. Mitigated by adding more
web nodes. In this regard,
web nodes can be considered compute-heavy more than anything else at large scale.
Bandwidth usage: aside from handling external traffic, the
web node will at times have to stream bits out from one worker and into another while executing Steps.
Memory usage: not very well classified at the moment as it's not generally a concern. Give it a few gigabytes and keep an eye on it.
Highly available: yes;
web nodes can all be configured the same (aside from
--peer-url) and placed behind a load balancer. Periodic tasks like garbage-collection will not be duplicated for each node.
Horizontally scalable: yes; they will coordinate workloads using the database, resulting in less work for each node and thus lower CPU usage.
dbon its configured port for persistence
dbon its configured port for locking and coordinating in a multi-
workernodes on ports
resources, executing builds, and performing garbage-collection
webnodes (possibly itself) on an ephemeral port when a worker is forwarded through the web node's TSA
Disk usage: arbitrary data will be written as builds run, and resource caches will be kept and garbage collected on their own life cycle. We suggest going for a larger disk size if it's not too much trouble. All state on disk must not outlive the worker itself; it is all ephemeral. If the worker is re-created (i.e. fresh VM/container and all processes were killed), it should be brought back with an empty disk.
CPU usage: almost entirely subject to pipeline workloads. More resources configured will result in more checking, and in-flight builds will use as much CPU as they want.
Bandwidth usage: again, almost entirely subject to pipeline workloads. Expect spikes from periodic checking, though the intervals should spread out over enough time. Resource fetching and pushing will also use arbitrary bandwidth.
Memory usage: also subject to pipeline workloads. Expect usage to increase with the number of containers on the worker and spike as builds run.
Highly available: not applicable. Workers are inherently singletons, as they're being used as drivers running entirely different workloads.
Horizontally scalable: yes; workers directly correlate to your capacity required by however many pipelines, resources, and in-flight builds you want to run. It makes sense to scale them up and down with demand.
external traffic to arbitrary locations as a result of periodic resource checking and running builds
external traffic to the
webnode's configured external URL when downloading the inputs for a
external traffic to the
webnode's TSA port (
2222) for registering the worker
Disk usage: pipeline configurations and various bookkeeping metadata for keeping track of jobs, builds, resources, containers, and volumes. In addition, all build logs are stored in the database. This is the primary source of disk usage. To mitigate this, users can configure
build_logs_to_retain on a job, but currently there is no operator control for this setting. As a result, disk usage on the database can grow arbitrarily large.
CPU usage: this is one of the most volatile metrics, and one we try pretty hard to keep down. There will be near-constant database queries running, and while we try to keep them very simple, there is always more work to do. Expect to feed your database with at least a couple cores, ideally four to eight. Monitor this closely as the size of your deployment and the amount of traffic it's handling increases, and scale accordingly.
Bandwidth usage: well, it's a database, so it most definitely uses the network (duh). Not much should stand out here, though build logs can result in an arbitrary amount of data being sent over the network to the database. This should be nothing compared to worker bandwidth, though.
Memory usage: similar to CPU usage, but not quite as volatile.
Highly available: up to you. Clustered PostgreSQL is kind of new and probably tricky to deploy, but there are various cloud solutions for this.
Horizontally scalable: I...don't think so?
only ever from the
webnode, specifically the ATC