Skip to main content

Enable high availability

Support, stability, and dependency info

High-availability Namespaces are in Public Preview for Temporal Cloud.

Some audits, updates. Needs intros, re-org. Suggest breaking down into "opting in", "setting up (worker and privatelink)", and "testing" because the content feels really mixed up right now and too long and the metrics section is now a little too short

You can enable the high-availability Namespace feature for your existing Namespace by adding a second zone to your Namespace. After adding the second zone, Temporal Cloud begins data replication for your new standby replica. Temporal Cloud notifies you once the replication has caught up and both Namespace zones are in sync.

Advantages of using a high-availability Namespace:

  • No manual deployment or configuration needed, just simple push-button operation.
  • Open Workflows continue in the standby region with minimal interruption and data loss.
  • No changes needed for Worker and Workflow code during setup or failover.
  • 99.99% Contractual SLA.

Create a multi-region Namespace

The following sections explain how to create a new multi-region Namespace (MRN). MRNs provide multi-region deployment backed by Temporal's data replication and active-standby features.

tip

While reading through this coverage, remember that pairing is currently limited to regions within the same continent.

Temporal Cloud Web UI

During Namespace creation, specify the first region for the Namespace. Then, select the “Add a region” option. Adding a second region enables multi-region Namespace capabilities.

Temporal 'tcld' CLI

Start with the following command to create the new multi-region Namespace:

tcld namespace create \
--namespace <namespace_id>.<account_id> \
--region <region>

Include both regions by specifying the region codes as arguments to the --region flags. Before pressing return, add your authentication credentials. For example, --ca-certificate-file <path-to-pem-file>.

Upgrade an existing single-zone Namespace for high-availability functionality

You can upgrade existing ssingle-zone Namespace for high-availability by adding a standby zone. The following sections show you how.

The following material has not been audited for MRN/HAN

Temporal Cloud Web UI

To upgrade an existing Namespace to a multi-region Namespace:

  1. Visit Temporal Cloud Namespaces in your Web browser
  2. Navigate to the Namespace details page
  3. Select the “Add a region” button.
  4. Select the standby region you want to add to this Namespace

You will see an estimated time for replication. This time is based on your selection and the size and scale of Workflows in your Namespace, An email alert is sent once your multi-region Namespace is ready for use.

Temporal 'tcld' CLI

At the command line, enter:

tcld namespace add-region \
--namespace <namespace_id>.<account_id> \
--region <region>

Specify the region code for the new region to add. Before pressing return, add your authentication credentials. For example, --ca-certificate-file <path-to-pem-file>. An email alert is sent once your multi-region Namespace is ready for use.

Discontinuing multi-region availability

Disabling multi-region removes the high availability and automatic failover features that provide Temporal's highest service level agreement. To disable the feature and end charges, users must contact Temporal Support directly. MRN-specific charges for replication will stop once this decommissioning procedure completes.

  • When making your request you must let us know which region you want the Namespace to land in after removing the standby region.
  • If you cease services in the middle of the month, your Namespace will be converted to a single region Namespace within 1 business day.
  • Temporal won't retain replicated data in the standby region once multi-region has been disabled.
  • After disabling multi-region, Temporal Cloud cannot re-enable the feature for a given Namespace for seven days.

Triggering failovers

Failovers happen automatically in Temporal when a regional outage or disaster affects a multi-region Namespace. You can also trigger a failover based on custom alerts or for testing purposes. This section explains how to manually trigger a failover and what to expect afterward.

Regular failover testing ensures your app can handle disruptions and continue running smoothly in production. Whether responding to incident warnings or conducting tests, follow the steps in the next sections to move your active Namespace to its standby region and learn how to handle failovers effectively.

For details on how Temporal detects conditions and triggers failovers automatically, see Failovers.

Check Your Replication Lag

Always check the metric replication lag before initiating a failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress.

Performing manual failovers

You can trigger a failover manually using the Temporal Cloud Web UI or the tcld CLI, depending on your preference and setup. The following table outlines the steps for each method:

MethodInstructions
Temporal Cloud Web UI1. Visit the Namespace page on the Temporal Cloud Web UI.
2. Navigate to your Namespace details page and select the Trigger a failover option from the menu.
3. After confirming, the failover will be initiated.
Temporal tcld CLITo manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
    --namespace <namespace_id>.<account_id> \
    --region <target_region>

Post-failover event information

After any failover, whether triggered by you or by Temporal, event information appears in both the Temporal Cloud Web UI (on the Namespace detail page) and in your audit logs. The audit log entry for Failover uses the "operation": "FailoverNamespace" event. After failover, the Namespace is active in the new region.

You don't need to monitor Temporal Cloud's failover response in real-time. Whenever there is a failover event, users with the Account Owner and Global Admin roles automatically receive an alert email.

Failbacks

After Temporal-initiated failovers, Temporal Cloud shifts Workflow Execution processing back to the original region that was active before the incident (a "failback") once the incident is resolved.

Reasons to test failing over

Microservices and external dependencies will fail at some point. Testing failovers ensures your app can handle these failures effectively. Temporal recommends regular and periodic failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your app continues to function even when parts of the infrastructure fail.

Safety First

If this is your first time performing a failover test, run it with a test-specific namespace and application. This helps you gain operational experience before applying it to your production environment. Practice runs help ensure the process runs smoothly during real incidents in production.

Trigger testing can:

  • Validate multi-region deployments: In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages. This maintains high availability in mission-critical deployments. Manual testing confirms the failover mechanism works as expected, so your system handles regional outages or disasters effectively.

  • Assess replication lag: Monitoring replication lag between regions is crucial in multi-region setups. Check the lag before initiating a failover to avoid rolling back Workflow progress. Manual testing helps you practice this critical step and understand its impact. When there's no real incident, the switch over (recovery) should happen almost instantly.

  • Assess recovery time: Manual testing helps you measure actual recovery time. You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the Multi-region Namespace SLA.

  • Identify potential issues: Failover testing uncovers problems not visible during normal operation. This includes issues like backlogs and capacity planning and how external dependencies behave during a failover event.

  • Validate fault-oblivious programming: Temporal uses a "fault-oblivious programming" model, where your app doesn’t need to explicitly handle many types of failures. Testing failovers ensures that this model works as expected in your app.

  • Operational readiness: Regular testing familiarizes your team with the failover process, improving their ability to handle real incidents when they arise.

Testing failovers regularly ensures your Temporal-based applications remain resilient and reliable, even when infrastructure fails.

Worker Deployment

Enabling the multi-region Namespace does not require specific Worker configuration. The process is invisible to the Workers. When a Namespace fails over to the standby region, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption. More details are available in the Routing section below.

info
  • When a Namespace fails over to a standby region, Workers will be communicating cross-region.

  • In case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to your standby region.

Routing

When using multi-region for a Namespace, the Namespace's DNS record <ns>.<acct>.<tmprl_domain> targets a regional DNS record in the format <region>.region.<tmprl_domain>. In this format, <region> is the currently active region for your Namespace. Clients resolving the Namespace’s DNS record are directed to connect to the active region for that Namespace, thanks to the regional DNS record.

During failover, Temporal Cloud changes the target of the Namespace DNS record from one region to another. Namespace DNS records are configured with a 15 seconds TTL. Any DNS cache should re-resolve the record within this delay. As a rule of thumb, DNS reconciliation takes no longer than twice (2x) the TTL. Clients should converge to the newly targeted region within, at, most a 30-second delay.

info

Some networking configuration is required for failover to be transparent to clients and workers when using PrivateLink. This section describes how to configure routing for multi-region Namespaces for PrivateLink customers only.

PrivateLink customers may need to change certain configurations for multi-region Namespace use. Routing configuration depends on networking setup and use of PrivateLink. You may need to:

  • override a DNS zone; and
  • ensure the network connectivity between the two regions.

Customer side solution example

When using PrivateLink, you connect to Temporal Cloud using IP addresses local to your network. The region.<tmprl_domain> zone is configured in the Temporal systems as an independent zone. This allows you to override it to make sure traffic is routed internally for the regions in use. You can check the Namespace's active region using the Namespace record CNAME, which is public.

To set up the DNS override, you override specific regions to target the relevant IP addresses (e.g. aws-us-west-1.region.tmprl.cloud to target 192.168.1.2). Using AWS, this can be done using a private hosted zone in Route53 for region.<tmprl_domain>. Link that private zone to the VPCs you use for Workers. Private Link is not yet offered for GCP multi-region Namespaces.

When your Workers connect to the Namespace, they first resolve the <ns>.<acct>.<tmprl_domain> record. This targets <active>.region.<tmprl_domain> using a CNAME. Your private zone overrides that second DNS resolution, leading traffic to reach the internal IP you're using.

Consider how you'll configure Workers to run in this scenario. You might set Workers to run in both regions at all times. Alternately, you could establish connectivity between the regions to redirect Workers once failover occurs.

The following table lists Temporal's available regions, PrivateLink endpoints, and DNS record overrides. The sa-east-1 region listed here is not yet available for use with multi-region Namespaces.

RegionPrivateLink Service NameDNS Record Override
ap-northeast-1com.amazonaws.vpce.ap-northeast-1.vpce-svc-08f34c33f9fb8a48aaws-ap-northeast-1.region.tmprl.cloud
ap-northeast-2com.amazonaws.vpce.ap-northeast-2.vpce-svc-08c4d5445a5aad308aws-ap-northeast-2.region.tmprl.cloud
ap-south-1com.amazonaws.vpce.ap-south-1.vpce-svc-0ad4f8ed56db15662aws-ap-south-1.region.tmprl.cloud
ap-south-2com.amazonaws.vpce.ap-south-2.vpce-svc-08bcf602b646c69c1aws-ap-south-2.region.tmprl.cloud
ap-southeast-1com.amazonaws.vpce.ap-southeast-1.vpce-svc-05c24096fa89b0ccdaws-ap-southeast-1.region.tmprl.cloud
ap-southeast-2com.amazonaws.vpce.ap-southeast-2.vpce-svc-0634f9628e3c15b08aws-ap-southeast-2.region.tmprl.cloud
ca-central-1com.amazonaws.vpce.ca-central-1.vpce-svc-080a781925d0b1d9daws-ca-central-1.region.tmprl.cloud
eu-central-1com.amazonaws.vpce.eu-central-1.vpce-svc-073a419b36663a0f3aws-eu-central-1.region.tmprl.cloud
eu-west-1com.amazonaws.vpce.eu-west-1.vpce-svc-04388e89f3479b739aws-eu-west-1.region.tmprl.cloud
eu-west-2com.amazonaws.vpce.eu-west-2.vpce-svc-0ac7f9f07e7fb5695aws-eu-west-2.region.tmprl.cloud
sa-east-1com.amazonaws.vpce.sa-east-1.vpce-svc-0ca67a102f3ce525aaws-sa-east-1.region.tmprl.cloud
us-east-1com.amazonaws.vpce.us-east-1.vpce-svc-0822256b6575ea37faws-us-east-1.region.tmprl.cloud
us-east-2com.amazonaws.vpce.us-east-2.vpce-svc-01b8dccfc6660d9d4aws-us-east-2.region.tmprl.cloud
us-west-2com.amazonaws.vpce.us-west-2.vpce-svc-0f44b3d7302816b94aws-us-west-2.region.tmprl.cloud
Learn more about multi-region Namespaces

If you have more questions or feedback about this feature, reach out to the product team.