Skip to main content

Monitor and observe

Support, stability, and dependency info

High-availability Namespaces are in Public Preview for Temporal Cloud.

No audits, updates, intros, re-org. Add information about unhealth vs health, and trigger failover button disabled, opting out of temporal-initiated failovers, "Unhealthy replica error". After moving things to enable, this seems really light on content

How do you trigger failovers and observe Workflow Executions? This section provides how-to instructions for the following operations tasks:

Metrics

Replication lag refers to the transmission delay of Workflow updates and history events from the active region to the standby region. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress, so always check the metric replication lag before initiating a failover. Temporal Cloud emits three replication lag-specific metrics. The following samples demonstrate how you can use these metrics to explore replication lag.

P99 replication lag histogram

histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le))

Average replication lag

sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace)
/
sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace)

Monitoring and observability

You can view and alert on key cloud metrics using the Web UI, the 'tcld' CLI utility, and Temporal Cloud APIs. For example, during the process of adding a region to a Namespace, you can see the progress of Workflow replication. Errors -- if any occur -- will also surface in the Namespace Web UI.

info

You may notice that multi-region Namespace shows twice (2x) the Action count in temporal_cloud_v0_total_action_count. This doubling happens due to regional replication.

Auditing operational events

Temporal Cloud provides several ways to audit events:

  • When Temporal triggers failovers, the audit log updates with details. Look specifically for "operation": "FailoverNamespace" in the logs.
  • You can set alerts for Temporal-initiated failover events.
  • After a failover, you can check that the Namespace is active in the new region using the Temporal Cloud Web UI.