July 2, 2019
This document presents the inner workings of Chorus validators currently operational on the Cosmos Hub, Loom’s PlasmaChain and the Terra network. Our team has spent >18 months thinking about, designing, deploying, maintaining and improving on our technical architecture.
Section 3 describes the high-level objectives we chose to design our validator(s). This is followed by a description of the main components and their role in the infrastructure. Focus then shifts to architecture diagrams and detailed descriptions of the components. Finally, we will cover the security features of our IT estate. The penultimate section will concern itself with future improvements followed by a conclusion of the key ideas.
This document is a description of the current state. The design will keep improving, which will necessitate updates to this document.
At the inception of Chorus in early 2018, we chose a set of guiding principles that determined the points on which our technical energy would be focused. In the following, we cover all of the key principles with an explanation of why they were chosen:
We believe there is an enormous risk when critical operations are combined with financial pressure. At Chorus, we’ve instituted a design that minimizes our team members’ need to conduct critical troubleshooting operations under financial duress. Our troubleshooting operations are usually done without any financial value on the line.
Later in the article, we shall demonstrate how these principles have shaped and strengthened our infrastructure. The key takeaway is that Chorus had a unique problem statement to build high reliability geographically dispersed infrastructure for a distributed team.
There are seven major components to our validation architecture. These components are: validation servers, key infrastructure, signing coordinator, sentry layer, central services, logging & monitoring and analytics. Figure 1 shows the 7 components diagrammatically, and in numerical order by which they will be described. We give an overview of each component, the infrastructure & tools used to deploy that component and major design decisions taken pertaining to that component. Section 5 will detail how the components and their subcomponents connected with each other. Access control mechanisms placed at the various connections are detailed in Section 6.
Figure 1. The main components and the order in which they will be covered
Validation servers are the machines which decide which block gets signed, and broadcast to the network. These machines run the core clients of all the networks we validate on, the Tendermint KMS, maintain memory pools of transactions for these networks, propose blocks and verify the veracity of accounting transactions. Security and high uptime of these machines is critical to the validator operation.
It is recommended for validation servers to not hold the validation keys of the validator, as compromise of the servers could lead to theft of key material. We use Hardware Security Modules (HSMs), specialised devices that prevent key theft, to store key material.
Major decisions we faced regarding validation servers were: How many validation servers to use, where to host them and what software to run on them. Security policies pertaining to these servers will be covered in Section 6.
First of all, let's visit the question of where to host these machines. There are 2 main options: Rely on a cloud provider such as AWS to provision these machines for Chorus. Or purchase servers, and purpose them as validation servers, in Tier IV data centers with 100% network uptime guarantee.
The main argument for validation servers on the cloud is the scalability and elasticity of capacity & the business operating freedom gained from a leasing model. This is counterbalanced by some significant short-term downsides. The first is that a majority of cloud machine offerings are at risk from, as of yet undiscovered and some previously discovered, speculative execution attacks. These attacks could potentially allow attackers to access sensitive infrastructure secrets if the server deployments are not setup correctly. Second, cloud machine offerings generally do not provide the ability to insert custom Hardware Security Modules in the machines. This would mean that validation keys would need to be stored and accessed on a Cloud HSM service like Amazon CloudHSM. To this date, neither Amazon nor Google CloudHSM offerings allow us to leverage Ed25519 keys required for the Cosmos Hub. They are also 12X more expensive per year compared to YubiHSM2, and would have required us to develop custom integrations with network clients. Third, many cloud offerings that are/can be made resistant against speculative execution attacks still don't allow companies such as ours to have full control over the software stack running on the machine. Therefore, we decided against putting our validation servers on clouds such as AWS and Google Cloud.
The above is no permanent indictment against validation servers running in the cloud. We believe that secure setups can be built with validation servers running in the cloud. Chorus will upgrade to such an approach in the future to make better use of both cloud key management and cloud machines. A later section outlines this possibility.
Once the cloud was eliminated for validation servers, we were left with executing the purchase of servers, and the selection of Tier IV data centers to host the servers. We onboarded two validation servers, one in North America and the other in Europe; managed in data centers operated by two different service providers. As explained in a later section, an active/active architecture has both of these validation servers work concurrently on blocks received from the networks.
Server placement across both sides of the Atlantic has the advantage of complete power and network connectivity isolation for the two servers. Even if the entire electrical grid of North America would go down, Chorus validator would keep signing blocks from Europe. This architecture also has us less dependent on any individual data center vendor (and national regulatory framework) and changes in their policies. On the other hand, the downside introduced by this choice is that the Chorus validator suffers internal signing latency (~10-80 ms) due to communication overhead required to coordinate these two servers across the Atlantic. This downside became apparent in Game of Stakes, an adversarial test network. Because of its adversarial nature, network message broadcast was not very consistent, and Chorus had a higher rate of missed blocks in normal operation compared to other validators centered in just one geographical location. This downside is immaterial in the main networks, and has been mitigated by upgrades in our signing/coordination infrastructure. If it were to assume significance, we can always change the locations of servers to address the issue completely.
Both data centers operate monitored closed circuit television, and are manned by both security and technical personnel 24/7/365. Access to racks is limited to the providers’ personnel via biometrics and access card controlled man-traps. The data centers are carrier neutral and connected via diverse routes to multiple tier 1 connectivity providers, and are monitored 24/7/365 by either a remote or on-site Network Operations Center (NOC). Similarly, both are connected to multiple power substations, and are backed by at least N+1 generators. As a result, both vendors offer 100% network connectivity and power SLAs. In addition, in the event of hardware failure, vendors are subject to a one hour hardware replacement SLA.
Sentries are full nodes (for a specific network) placed between the public Internet and the validation servers. Sentry nodes ensure that validation servers are not directly exposed to the public Internet, connect to other full nodes on the network, gossip transactions plus blocks, keep validation servers up in sync with the network, and act as the first line of defense against DDoS attacks. They can also be repurposed for the role of transaction filtering to guard against transaction spam attacks. We've deployed sentry nodes on AWS, as they are less security critical than the validation servers. For the Cosmos Hub, six Chorus sentries are placed in the United States, the European Union and Korea. Geographical redundancy helps the network gossip protocol, and keeps our validators better in sync with the network as we are able to leverage the cloud provider’s backbone network to gossip messages between continents offering us lower latency than if we were peer the same servers over the public Internet.
A subset of Chorus sentries are private sentries. Private sentries do not actively accept inbound network connections and make outbound connections via a NAT gateway. They are connected to private sentries of other validators to create an inter-validator gossip network. Chorus has partnered with several validation firms to build a private sentry network. Private sentries act as a second line of defense against DDoS attacks. Even if all Chorus public facing sentries were successfully taken down, our validators would stay in sync with the Cosmos network via the private sentry layer.
In addition to sentries, Chorus runs load balanced seed nodes on AWS. The primary purpose of these nodes are to crawl the Tendermint p2p network and keep an updated address book of all other full nodes. Newly joined, or badly connected, full nodes are able to connect to the Chorus seed nodes, and update their address books. A secondary purpose is to allow developers to make blockchain data queries during their application development lifecycle. Chorus seed nodes are a public utility run to ensure the health of various networks onboarded by the firm. For Cosmos, they are available at: https://cosmos.chorus.one:26657 and for Terra, at https://terra.chorus.one:26657. The seed nodes also run copies of the light client daemon (LCD) for both Cosmos and Terra, available to the public on https://cosmos-lcd.chorus.one:1317 and https://terra-lcd.chorus.one:1317 respectively.
Validation keys that sign all blocks on Tendermint networks, are critical security parameters of our validation infrastructure. These keys are provisioned in YubiHSM2 Hardware Security Modules, and physically inserted into the validation servers. The generation of keys, configuration of key material into hardware security modules, backup of keys and restoration of keys are important processes for a validation company since any compromise in the supply chain will compromise the validator. Chorus has invested significant resources in its key infrastructure in order to build it compliant with the standards promulgated by NIST.
The central feature of the Chorus Key Infrastructure is the division of roles pertaining to the key lifecycle between different team members. We've created four internal roles for the various stages of the key infrastructure lifecycle and allocated them to different team members:
Software deployments in the Chorus Infrastructure ensure that team members occupying certain roles have the correct permissions and the least privileges required for their role. Let's take an example:
Assume the scenario is the configuration of 2 HSMs with the validation key of a newly onboarded network. Post configuration, the HSMs will be plugged into servers to sign blocks. Logs from the HSM will be studied by the Cryptographic Auditors. One day, the HSMs wiill be retired and sensitive key material must be deleted from them. In this particular HSM, permissions must be set up such that:
In order to perform fine-grained provisioning over Chorus keys, we’ve developed HSMman, an in-house key management tool. This tool automates the provisioning and configuration management of all of our HSMs. We have the capability to quickly provision new HSMs with key material as needed. Chorus also has built its key architecture to have lots of spare capacity and is able to onboard new networks while running sensitive cryptographic ceremonies sparingly. Another critical feature of HSMman is its ability for encrypted key shares to be distributed between ceremony personnel over the internet. It enables key shares to be communicated securely out-of-band between two air-gapped computers, such that we are able to conduct the sharing of cryptographic material remotely whilst maintaining its absolute secrecy. One-time encryption keys are used for the encryption and distribution of key shares.
Chorus runs an active/active validation architecture. This means that there are two servers, one in Europe and one in North America, attempting to sign blocks on the PoS networks concurrently. Whenever a new (valid) block is proposed by a different validator, both these servers compete to get "signing rights" for that particular block. One of the servers ends up winning the race, and it's that server that signs the block. The other server is prohibited from signing any block on the same HRS (height,round,step) combination.
We've invested in this particular architecture for the following reasons:
This active/active architecture required us to design a signing coordinator - a set of components that allocate "signing rights" to servers; and ensure that only a single server can have a signing right at a particular HRS (height, round, step).
The major piece of the signing coordinator is a strongly consistent, multi-region, multimaster database operated by AWS. We've used their DynamoDB product to instantiate this database. A server's signing right is represented as the entry of a database row with the primary key corresponding to the HRS of the particular signing step. Both servers attempt to insert entries into the database for a particular HRS. Once the first server inserts the entry, the second server will be forbidden from entering a different entry with the same HRS in the database. In effect, the database acts as a pigeonhole for every signing step. Two competing pigeons try to enter the hole, but the pigeonhole is configured to accept only one of the pigeons.
On the signing server-side, prior to signing any block, the servers must first complete the corresponding entry in the DynamoDB database. Only when they are successful in inserting the entry, will they be able to sign and move forward. This logic, on the signing server-side, was implemented on top of the Tendermint KMS in Rust. Chorus plans to open source the components of its active/active setup sometime in the Summer of 2019.
Central Services is a catchall component that comprises some of the important core utilities required by the validation infrastructure. If the entire validator estate is like a city, central services is the raw infrastructure - water, power, waste management - required by most of the other components.
In the Chorus validation architecture, Central Services contains the following services:
We also use Vault to generate short-term role-based credentials for cloud providers and databases, such that persistent user accounts or roles are not required for the majority of use cases. Low TTLs means that any inadvertent exposure of any credentials has a far more limited impact.
Finally, the tokens generated for the above short-term credential sets are single use. If a user were to compromise a validation server, and query vault for the authentication password for a key stored on a HSM using the most recently used token, their request would be rejected.
Observability of an Infrastructure estate is one of the most basic, but most powerful tools at the disposal of an engineer. It allows one to view, understand, proactively respond, and be automatically alerted to changes in the behaviour or performance of servers.
It can be broken down into three main streams:
Figure 2. A screenshot of one of our monitoring dashboards
We use a combination of Prometheus, Menoetius, Grafana and Influxdb in a high-availability layout for our internal monitoring, coupled with an external alerting service, which manages 24/7 our on-call rotation, and alert escalation processes. Our internal monitoring services is in turn monitored by an external monitoring service, so failure of the internal service will itself raise an alert, such that we do not blindly miss outage events due to monitoring service downtime. Also monitored are our external endpoints - from website and blog through to seed nodes and sentries. If any of these items experience over one minute of downtime we are immediately alerted.
In terms of logging, we run a setup comprising Elasticsearch, Logstash and Kibana, commonly referred to as the ELK stack. Logs are collected on each node and pushed over Fluentd over https to a high-availability queueing service, so that logs are not lost if Logstash processing nodes experience an outage.
The onboarding of any new network is not deemed complete until we have sufficient logging and monitoring in place to support the operation of that network.
In order to better inform our business decisions, and to bootstrap future development we have invested a significant amount of effort in writing an analytics engine called Blockstream. It connects to and subscribes to events from all the blockchains on which we validate. Each event has an observer, to which one or more processing modules can be assigned. The processed data can be indexed and stored in a way that allows us to run complex queries against the data that would be computationally expensive if they were run against the chain each and every time. It has been designed to be blockchain agnostic, and easily extensible to incorporate new chains with little effort.
Our Infrastructure is broken down into logical units, that we refer to as ServiceInstances. Each ServiceInstance has a specific purpose (e.g. cosmos network, logging, central services, etc.) and spans multiple cloud-provider regions, in order to ensure no service can fail due to an outage of a single region or vendor.
Cloud-regions within a ServiceInstance are peered with one another ensuring we are able to leverage cloud-providers' backbone networks to leverage the lowest possible latency connections between sentries.
Access controls (discussed further in Section 6) are in place such that individuals are unable to traverse between ServiceInstances. Only traffic that is explicitly permitted can pass between nodes, so compromise of a node in the Loom ServiceInstance should not be able to impact the operation of our Cosmos validator.
Furthermore, each ServiceInstance comprises public and private subnets. Nodes that expose ports to the public Internet reside in the former, with non-public nodes sitting in the latter routing outbound connections through highly available NAT gateways.
Figure 3. High-level overview of the current Chorus sentry and validation infrastructure
This section aims to cover the specific layers in our multi-layer security model, and how these steps combine to create an industry leading validation infrastructure.
Figure 4. Access control diagram of the Chorus infrastructure
By using TLS encryption we can protect communication between entities from Man in the Middle attacks (MitM Attacks, Wikipedia). With the exception of the P2P ports, all publicly available endpoints (and a vast majority of internal services) are secured using TLS. With the advent of free SSL certificates through LetsEncrypt, TLS secured communications are within reach of everybody.
In order to mitigate the impact on hosts of denial of service attacks (Denial of Service Attacks, Wikipedia) we route inbound packets from the public Internet via cloud-provider DDoS protection.
In addition to our public sentries, we run 'private' sentries behind NAT (Network Address Translation, Wikipedia) gateways that only permit outbound connections so that in the event of a targeted attack against our public nodes, we will be able to continue receipt and transmission of P2P packets to and from the wider network.
These private sentries are also ‘privately peered’ with carefully selected partners via cloud-provider network peering, to ensure both parties benefit from each others’ connectivity even in the harshest of adversarial climates.
Operation of our nodes is ultimately conducted via SSH - although we minimise this as much as humanly possible through automation - but as poorly SSH protected SSH endpoints are a primary target for any intruder, we do not expose SSH on any port (exposing on a non-default port is considered security-through-obscurity, and a poor countermeasure), preferring to utilise an encrypted VPN for all control traffic in and out of our infrastructure.
As discussed above, all administrative and operations traffic in and out of the network traverses over a VPN. Each ServiceInstance has it's own VPN to allow control over access to ServiceInstances on a least-privilege basis. Users' access to a VPN requires a certificate (that can be revoked immediately in the event of loss, disclosure or termination of position), user and password combination and multi-factor One-time Passcode (One-time Passcode, Wikipedia) token.
Once connected over VPN, further SSH connectivity to any given node inside our infrastructure requires the user's public key (PKI) and MFA (Multi-factor authentication) token. User access to nodes is provided per-node on a least-privilege basis. As mentioned above, SSH traversal between nodes and/or between ServiceInstances is not permitted.
Access between networks (ingress from public Internet, public -> private networks, communications between internal services, egress to public Internet) is controlled by per-service firewall rules. Only packets to and from whitelisted IPs and/or ports are permitted through the firewall.
In addition to network-based firewalls above, ingress and egress of data to and from any given host is controlled by host-based firewall rules. Packets failing these rules are logged such that unauthorized egress of data will be alerted upon.
Our HIDS triggers events based upon rulesets when certain conditions are met (for example, attempted write to configuration files, prohibited system calls, etc.) and raises an alert to the team, allowing us to respond swiftly.
All user access accounts permissions are based on a least-privilege model (Principle of Least Privilege, Wikipedia), where the access afforded to a user is based upon the minimum permissions required to fulfil their duties. This applies to every service where user accounts are created, and for all users regardless of their position in the company.
For non-user service access (application access to databases, container registries, remote storage, etc.) we utilise, where possible, lease-based credentials with limited scope and either single-use lifetime or, at worst case, a short TTL. In this case, if credentials are unintentionally disclosed, the impact is limited to a small subset of permissions and for a very limited time period. In the event of disclosure of an already used single-use credential, an attacker will be unable to authenticate with the service for a second time.
For services for which lease-based access is not-suited, we utilise Hashicorp Vault, an open-source FIPS-140-2 compliant highly-available secrets management service. Secrets are stored encrypted, and retrieved using a one-time token to limit access to authorized processes.
As discussed above, all signing regular signing operations (network consensus, price oracles, etc.) use YubiHSM2 from YubiCo. Authentication user credentials are stored securely in Vault, and when required to unlock the HSM for signing, retrieved as a single-use wrapped token representing a short-TTL lease and the secret is never persisted to disk.
All irregular signing activities (withdrawing, delegating, sending of assets), where the keys are not required to be 'online', are conducted using company-owned Ledger hardware wallet devices.
We have invested a significant amount of energy into building a very secure and highly available validator, but no infrastructure is "finished". There will always be the next improvement to make; the next defense to introduce into one’s setup and the next networks to onboard. In this section, we cover some projects or improvements that can become important themes for Chorus in the future.
One new development in the infrastructure security industry is the inception of commercially usable Trusted Execution Environments (TEEs). The most well-known, and well developed, TEE is Intel SGX. TEEs such as Intel SGX allow infrastructure builders to isolate critical parts of the application code and run them in hardware enclaves. Code running in hardware enclaves has greater resistance against external attackers. Such code offers guarantees against unauthorized access and attempts to change the code even if the host machine running the enclave is compromised by an attacker.
One interesting possibility is to run Tendermint KMS code on SGX Enclaves. The Tendermint KMS code is one of the critical parts of the Chorus active/active system. Doing so would provide a line of defense against putative attackers that have compromised the validation servers. If the KMS is run on SGX enclaves, it would give the company the ability to recover against compromises of validation servers; since it would take a prohibitive amount of time/resources for the attackers to penetrate the SGX defenses.
Discussions around Intel SGX often devolve into an analysis of how strong the security guarantees of Intel SGX are. Intel hardware, including SGX, has been beset by speculative execution attacks that give vectors to penetrate the defenses of hardware enclaves. Our perspective is that Intel SGX is not the be-all or end-all of application security. It is merely one line of defense for infrastructure builders, and should be used in conjunction with other lines of defense. Intel SGX does have its utility as a line of defense, because it serves to retard attackers for long enough to make a material difference to the overall security of validator infrastructure.
Validation is a young industry. There are only 10-20 networks that are genuinely interesting enough to validate today. Over time, we foresee an explosion in the number and variety of PoS networks. Perhaps, one day, the validation problem statement might be to onboard >50, or even >100 networks every year. By comparison, we're on track to onboard 5-10 networks this year.
As the sheer number of networks to be onboarded grows, our infrastructure will need to evolve to scale onboarding costs better. We could foresee the following become bottlenecks to our infrastructure:
The combination of both of the above is ultimately a purely cloud validator - some setup that runs on the cloud and is very easy to scale horizontally. It is bundled with automation that allows us to onboard new networks in an automated fashion. Chorus will likely move in that direction with the evolution of the validation space.
In building out our infrastructure, we have sought to walk the fine line between security and agility: implementing all of the most common security best practices while ensuring agility around our ability to scale out and onboard new networks and new software. We’ve built our infrastructure for our specific needs as a distributed organisation. We are always keen to help out fellow validators with their infrastructure - give us a shout if we could be of help!