Agents Overview

This topic provides an overview of the Consul agent, which is the core process of Consul. The agent maintains membership information, registers services, runs checks, responds to queries, and more. The agent must run on every node that is part of a Consul cluster.

Agents run in either client or server mode. Client nodes are lightweight processes that make up the majority of the cluster. They interface with the server nodes for most operations and maintain very little state of their own. Clients run on every node where services are running.

In addition to the core agent operations, server nodes participate in the consensus quorum. The quorum is based on the Raft protocol, which provides strong consistency and availability in the case of failure. Server nodes should run on dedicated instances because they are more resource intensive than client nodes.

Lifecycle

Every agent in the Consul cluster goes through a lifecycle. Understanding the lifecycle is useful for building a mental model of an agent's interactions with a cluster and how the cluster treats a node. The following process describes the agent lifecycle within the context of an existing cluster:

An agent is started either manually or through an automated or programmatic process. Newly-started agents are unaware of other nodes in the cluster.
An agent joins a cluster, which enables the agent to discover agent peers. Agents join clusters on startup when the join command is issued or according the auto-join configuration.
Information about the agent is gossiped to the entire cluster. As a result, all nodes will eventually become aware of each other.
Existing servers will begin replicating to the new node if the agent is a server.

Failures and crashes

In the event of a network failure, some nodes may be unable to reach other nodes. Unreachable nodes will be marked as failed.

Distinguishing between a network failure and an agent crash is impossible. As a result, agent crashes are handled in the same manner is network failures.

Once a node is marked as failed, this information is updated in the service catalog.

Note: Updating the catalog is only possible if the servers can still form a quorum. Once the network recovers or a crashed agent restarts, the cluster will repair itself and unmark a node as failed. The health check in the catalog will also be updated to reflect the current state.

Exiting nodes

When a node leaves a cluster, it communicates its intent and the cluster marks the node as having left. In contrast to changes related to failures, all of the services provided by a node are immediately deregistered. If a server agent leaves, replication to the exiting server will stop.

To prevent an accumulation of dead nodes (nodes in either failed or left states), Consul will automatically remove dead nodes out of the catalog. This process is called reaping. This is currently done on a configurable interval of 72 hours (changing the reap interval is not recommended due to its consequences during outage situations). Reaping is similar to leaving, causing all associated services to be deregistered.

Limit traffic rates

You can define a set of rate limiting configurations that help operators protect Consul servers from excessive or peak usage. The configurations enable you to gracefully degrade Consul servers to avoid a global interruption of service. Consul supports global server rate limiting, which lets configure Consul servers to deny requests that exceed the read or write limits. Refer to Traffic Rate Limits Overview.

Requirements

You should run one Consul agent per server or host. Instances of Consul can run in separate VMs or as separate containers. At least one server agent per Consul deployment is required, but three to five server agents are recommended. Refer to the following sections for information about host, port, memory, and other requirements:

The Datacenter Deploy tutorial contains additional information, including licensing configuration, environment variables, and other details.

Maximum latency network requirements

Consul uses the gossip protocol to share information across agents. To function properly, you cannot exceed the protocol's maximum latency threshold. The latency threshold is calculated according to the total round trip time (RTT) for communication between all agents. Other network usages outside of Gossip are not bound by these latency requirements (i.e. client to server RPCs, HTTP API requests, xDS proxy configuration, DNS).

For data sent between all Consul agents the following latency requirements must be met:

Average RTT for all traffic cannot exceed 50ms.
RTT for 99 percent of traffic cannot exceed 100ms.

Starting the Consul agent

Start a Consul agent with the consul command and agent subcommand using the following syntax:

$ consul agent <options>

Consul ships with a -dev flag that configures the agent to run in server mode and several additional settings that enable you to quickly get started with Consul. The -dev flag is provided for learning purposes only. We strongly advise against using it for production environments.

Getting Started Tutorials: You can test a local agent in a VM by following the Get Started tutorials.

When starting Consul with the -dev flag, the only additional information Consul needs to run is the location of a directory for storing agent state data. You can specify the location with the -data-dir flag or define the location in an external file and point the file with the -config-file flag.

You can also point to a directory containing several configuration files with the -config-dir flag. This enables you to logically group configuration settings into separate files. See Configuring Consul Agents for additional information.

The following example starts an agent in dev mode and stores agent state data in the tmp/consul directory:

$ consul agent -data-dir=tmp/consul -dev

Agents are highly configurable, which enables you to deploy Consul to any infrastructure. Many of the default options for the agent command are suitable for becoming familiar with a local instance of Consul. In practice, however, several additional configuration options must be specified for Consul to function as expected. Refer to Agent Configuration topic for a complete list of configuration options.

Understanding the agent startup output

Consul prints several important messages on startup. The following example shows output from the consul agent command:

$ consul agent -data-dir=/tmp/consul
==> Starting Consul agent...
==> Consul agent running!
       Node name: 'Armons-MacBook-Air'
      Datacenter: 'dc1'
          Server: false (bootstrap: false)
     Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600)
    Cluster Addr: 192.168.1.43 (LAN: 8301, WAN: 8302)

==> Log data will now stream in as it occurs:

    [INFO] serf: EventMemberJoin: Armons-MacBook-Air.local 192.168.1.43
...

Node name: This is a unique name for the agent. By default, this is the hostname of the machine, but you may customize it using the -node flag.
Datacenter: This is the datacenter in which the agent is configured to run. For single-DC configurations, the agent will default to dc1, but you can configure which datacenter the agent reports to with the -datacenter flag. Consul has first-class support for multiple datacenters, but configuring each node to report its datacenter improves agent efficiency.
Server: This indicates whether the agent is running in server or client mode. Running an agent in server mode requires additional overhead. This is because they participate in the consensus quorum, store cluster state, and handle queries. A server may also be in "bootstrap" mode, which enables the server to elect itself as the Raft leader. Multiple servers cannot be in bootstrap mode because it would put the cluster in an inconsistent state.
Client Addr: This is the address used for client interfaces to the agent. This includes the ports for the HTTP and DNS interfaces. By default, this binds only to localhost. If you change this address or port, you'll have to specify a -http-addr whenever you run commands such as consul members to indicate how to reach the agent. Other applications can also use the HTTP address and port to control Consul.
Cluster Addr: This is the address and set of ports used for communication between Consul agents in a cluster. Not all Consul agents in a cluster have to use the same port, but this address MUST be reachable by all other nodes.

When running under systemd on Linux, Consul notifies systemd by sending READY=1 to the $NOTIFY_SOCKET when a LAN join has completed. For this either the join or retry_join option has to be set and the service definition file has to have Type=notify set.

Configuring Consul agents

You can specify many options to configure how Consul operates when issuing the consul agent command. You can also create one or more configuration files and provide them to Consul at startup using either the -config-file or -config-dir option. Configuration files must be written in either JSON or HCL format.

Consul Terminology: Configuration files are sometimes called "service definition" files when they are used to configure client agents. This is because clients are most commonly used to register services in the Consul catalog.

The following example starts a Consul agent that takes configuration settings from a file called server.json located in the current working directory:

$ consul agent -config-file=server.json

The configuration options necessary to successfully use Consul depend on several factors, including the type of agent you are configuring (client or server), the type of environment you are deploying to (e.g., on-premise, multi-cloud, etc.), and the security options you want to implement (ACLs, gRPC encryption). The following examples are intended to help you understand some of the combinations you can implement to configure Consul.

Common configuration settings

The following settings are commonly used in the configuration file (also called a service definition file when registering services with Consul) to configure Consul agents:

Parameter	Description	Default
`node_name`	String value that specifies a name for the agent node. See `-node-id` for details.	Hostname of the machine
`server`	Boolean value that determines if the agent runs in server mode. See `-server` for details.	`false`
`datacenter`	String value that specifies which datacenter the agent runs in. See -datacenter for details.	`dc1`
`data_dir`	String value that specifies a directory for storing agent state data. See `-data-dir` for details.	none
`log_level`	String value that specifies the level of logging the agent reports. See `-log-level` for details.	`info`
`retry_join`	Array of string values that specify one or more agent addresses to join after startup. The agent will continue trying to join the specified agents until it has successfully joined another member. See `-retry-join` for details.	none
`addresses`	Block of nested objects that define addresses bound to the agent for internal cluster communication.	`"http": "0.0.0.0"` See the Agent Configuration page for default address values
`ports`	Block of nested objects that define ports bound to agent addresses. See (link to addresses option) for details.	See the Agent Configuration page for default port values

Server node in a service mesh

The following example configuration is for a server agent named "consul-server". The server is bootstrapped and the Consul GUI is enabled. The reason this server agent is configured for a service mesh is that the connect configuration is enabled. The connect subsystem provides Consul's service mesh capabilities, including service-to-service connection authorization and encryption using mutual Transport Layer Security (TLS). Applications can use sidecar proxies in a service mesh configuration to establish TLS connections for inbound and outbound connections without being aware of Consul service mesh at all. Refer to Consul Service Mesh for details.

node_name = "consul-server"
server    = true
bootstrap = true
ui_config {
  enabled = true
}
datacenter = "dc1"
data_dir   = "consul/data"
log_level  = "INFO"
addresses {
  http = "0.0.0.0"
}
connect {
  enabled = true
}

Server node with encryption enabled

The following example shows a server node configured with encryption enabled. Refer to the Security chapter for additional information about how to configure security options for Consul.

node_name = "consul-server"
server = true
ui_config {
  enabled = true
}
data_dir = "consul/data"
addresses {
  http = "0.0.0.0"
}
retry_join = [
  "consul-server2",
  "consul-server3"
]
encrypt = "aPuGh+5UDskRAbkLaXRzFoSOcSM+5vAK+NEYOWHJH7w="

tls {
  defaults {
    verify_incoming = true
    verify_outgoing = true
    ca_file = "/consul/config/certs/consul-agent-ca.pem"
    cert_file = "/consul/config/certs/dc1-server-consul-0.pem"
    key_file = "/consul/config/certs/dc1-server-consul-0-key.pem"
    verify_server_hostname = true
  }
}

Client node registering a service

Using Consul as a central service registry is a common use case. The following example configuration includes common settings to register a service with a Consul agent and enable health checks. Refer to Define Health Checks to learn more about health checks.

node_name  = "consul-client"
server     = false
datacenter = "dc1"
data_dir   = "consul/data"
log_level  = "INFO"
retry_join = ["consul-server"]
service {
  id      = "dns"
  name    = "dns"
  tags    = ["primary"]
  address = "localhost"
  port    = 8600
  check {
    id       = "dns"
    name     = "Consul DNS TCP on port 8600"
    tcp      = "localhost:8600"
    interval = "10s"
    timeout  = "1s"
  }
}

Client node with multiple interfaces or IP addresses

The following example shows how to configure Consul to listen on multiple interfaces or IP addresses using a go-sockaddr template.

The bind_addr is used for internal RPC and Serf communication (read the Agent Configuration for more information).

The client_addr configuration specifies IP addresses used for HTTP, HTTPS, DNS and gRPC servers. (read the Agent Configuration for more information).

node_name = "consul-server"
server    = true
bootstrap = true
ui_config {
  enabled = true
}
datacenter = "dc1"
data_dir   = "consul/data"
log_level  = "INFO"

# used for internal RPC and Serf
bind_addr = "0.0.0.0"

# Used for HTTP, HTTPS, DNS, and gRPC addresses.
# loopback is not included in GetPrivateInterfaces because it is not routable.
client_addr = "{{ GetPrivateInterfaces | exclude \"type\" \"ipv6\" | join \"address\" \" \" }} {{ GetAllInterfaces | include \"flags\" \"loopback\" | join \"address\" \" \" }}"

# advertises gossip and RPC interface to other nodes
advertise_addr = "{{ GetInterfaceIP \"en0\" }}"

Stopping an agent

An agent can be stopped in two ways: gracefully or forcefully. Servers and Clients both behave differently depending on the leave that is performed. There are two potential states a process can be in after a system signal is sent: left and failed.

To gracefully halt an agent, send the process an interrupt signal (usually Ctrl-C from a terminal, or running kill -INT consul_pid ). For more information on different signals sent by the kill command, see here

When a Client is gracefully exited, the agent first notifies the cluster it intends to leave the cluster. This way, other cluster members notify the cluster that the node has left.

When a Server is gracefully exited, the server will not be marked as left. This is to minimally impact the consensus quorum. Instead, the Server will be marked as failed. To remove a server from the cluster, the force-leave command is used. Using force-leave will put the server instance in a left state so long as the Server agent is not alive.

Alternatively, you can forcibly stop an agent by sending it a kill -KILL consul_pid signal. This will stop any agent immediately. The rest of the cluster will eventually (usually within seconds) detect that the node has died and notify the cluster that the node has failed.

For client agents, the difference between a node failing and a node leaving may not be important for your use case. For example, for a web server and load balancer setup, both result in the same outcome: the web node is removed from the load balancer pool.

The skip_leave_on_interrupt and leave_on_terminate configuration options allow you to adjust this behavior.