This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Always On Availability Groups

Check High Availability of databases

1: Always On Availability Group Detail

The Always On Availability Groups dashboard provides comprehensive visibility into the health and status of SQL Server Always On Availability Groups across your monitored instances. This dashboard helps you quickly verify AG configuration, monitor replica synchronization status, identify failover readiness issues, and ensure your high availability infrastructure is operating correctly.

Always On Availability Groups dashboard showing AG health, replica status, and synchronization state

Dashboard Overview

The dashboard displays a summary table of all configured Availability Groups across your SQL Server estate, making it easy to assess the health of your high availability infrastructure at a glance. Use this dashboard to perform regular health checks, verify failover readiness, and quickly identify AGs requiring investigation or intervention.

Availability Groups Table

The Availability Groups table provides detailed information about each configured AG:

Availability Group displays the AG name as a clickable link. Click the AG name to open the detailed AG dashboard showing per-replica metrics, database synchronization progress, redo queue depth, and comprehensive failover readiness information.

Primary Replica shows the current primary replica hostname. The primary replica handles all read-write operations and is the source for transaction log records sent to secondary replicas. In a properly functioning AG, this should match your expected primary server. If the primary is unexpected, a failover may have occurred that requires investigation.

Secondary Replicas displays a comma-separated list of all configured secondary replica hostnames. Secondary replicas receive transaction log records from the primary and can serve read-only workloads depending on configuration. This column helps you quickly verify all expected replicas are configured.

Total Nodes shows the total number of replicas configured in the AG, including the primary. Most AGs have 2-3 replicas, though SQL Server supports more for specific scenarios. This count should match your expected AG topology.

Online Nodes displays how many replicas are currently online and reachable. This should equal Total Nodes in a healthy AG. Values less than Total Nodes indicate one or more replicas are offline, disconnected, or experiencing connectivity issues, a critical situation requiring immediate investigation.

N. Databases shows the number of databases protected by this AG. This helps you understand the scope and importance of each AG. AGs protecting many databases or critical systems deserve closer monitoring.

Synchronization Health displays the overall synchronization state of the AG, typically showing “HEALTHY” (in green) when all replicas are synchronized and failover-ready, or “NOT HEALTHY” when synchronization issues exist. Unhealthy synchronization states indicate data protection risks and potential failover problems.

Listener DNS Name shows the AG listener’s DNS name if configured. Applications should connect to this listener name rather than directly to instance names, allowing transparent failover without connection string changes.

Listener IP displays the IP address or addresses associated with the AG listener. In multi-subnet configurations, multiple IPs may appear. Verify these IPs match your expected listener configuration.

Important

When Online Nodes is less than Total Nodes, one or more replicas are offline or unreachable. This reduces redundancy and may prevent automatic failover. Investigate immediately to restore full AG protection.

Using the Dashboard

Regular Health Checks: Review this dashboard daily to verify all AGs show “HEALTHY” synchronization status and all configured replicas are online. Early detection of synchronization issues prevents data loss during failovers.

Drill Down for Details: Click any AG name to access the detailed AG dashboard showing replica-level metrics including synchronization state, redo queue depth, log send rate, and database-specific synchronization status. Use these details to diagnose synchronization delays or performance issues.

Verify After Failovers: After planned or unplanned failovers, use this dashboard to confirm the expected server is now primary and all replicas have resynchronized. Verify listener DNS and IP addresses resolve correctly to the new primary.

Monitor Synchronization Health: “NOT HEALTHY” status requires immediate investigation. Common causes include network issues, replica performance problems, long-running transactions on the primary, or redo thread bottlenecks on secondaries. The detailed AG dashboard provides metrics to pinpoint the root cause.

Track Replica Topology: Use the Total Nodes and Secondary Replicas columns to maintain awareness of your AG configuration. Changes to expected topology may indicate configuration drift or unauthorized modifications requiring investigation.

Tip

Listener Connectivity: Always configure and use AG listeners for application connections. Listeners enable automatic connection redirection during failovers, eliminating manual connection string updates and reducing application downtime.

Investigating Issues

Offline Replicas: When Online Nodes is less than Total Nodes, check whether the offline replica is stopped, whether Windows Server Failover Clustering (WSFC) quorum is healthy, whether network connectivity exists between replicas, or whether the SQL Server service is running on the offline node.

Unhealthy Synchronization: Synchronization health issues may result from network bandwidth limitations preventing log records from reaching secondaries quickly enough, secondary replica performance problems causing redo queue buildup, transaction log I/O bottlenecks on primary or secondary replicas, or very large transactions overwhelming synchronization capacity.

Unexpected Primary Replica: If the primary replica is not the expected server, determine whether a planned failover occurred, whether an automatic failover responded to a failure, whether a manual failover was performed without proper communication, or whether cluster node preferences have changed.

Missing or Incorrect Listener Information: Verify the listener is properly configured in Windows Server Failover Clustering, confirm DNS records exist and resolve correctly, check that listener IP addresses are reachable from application servers, and ensure no firewall rules block listener ports.

Use the Instance Overview dashboard to check resource utilization, performance metrics, and wait statistics on primary and secondary replicas. High CPU, memory pressure, or I/O bottlenecks can impact AG synchronization performance.

Review the Blocking and Deadlocks dashboards if synchronization issues correlate with locking problems. Long-running transactions holding locks can delay log truncation and impact AG performance.

Check the SQL Server I/O Analysis dashboard to evaluate transaction log write performance on both primary and secondary replicas. Slow log I/O directly impacts synchronization speed and data protection.

Synchronous vs Asynchronous Commit

Synchronous commit mode provides zero data loss but requires waiting for secondary replica acknowledgment, potentially impacting transaction performance. Asynchronous commit mode offers better performance but allows potential data loss during failover.

Only synchronous commit mode is suitable for high availability, while asynchronous commit may be appropriate for disaster recovery scenarios or for offloading read-only workloads to secondary replicas.

1 - Always On Availability Group Detail

Check the state of a High Availability Group

The Always On Availability Group Detail dashboard provides comprehensive health and replication metrics for a single Availability Group, allowing you to monitor replica status, track failover history, analyze data movement performance, and identify synchronization issues. This dashboard is essential for troubleshooting AG problems, verifying failover readiness, and ensuring your high availability infrastructure operates optimally.

Availability Group Detail dashboard showing replica health, failover history, and replication metrics

Dashboard Sections

AG Summary

The summary section at the top displays key information about the Availability Group configuration and current state:

Availability Group displays the AG name for reference.

Primary Replica shows the current primary replica hostname. This is the replica handling all read-write operations and serving as the transaction log source for secondary replicas.

Secondary Replicas lists all configured secondary replica hostnames, showing your AG topology at a glance.

Total Nodes displays the total number of replicas configured in the AG, including primary and all secondaries.

Online Nodes shows how many replicas are currently online and reachable. This should equal Total Nodes in a healthy AG. Lower values indicate offline replicas that reduce redundancy and may prevent automatic failover.

N. Databases shows the number of databases protected by this AG.

Synchronization Health displays the overall synchronization state, typically “HEALTHY” when all databases on all replicas are synchronized and ready for failover, or showing issues when synchronization problems exist.

Listener Name displays the AG listener’s DNS name if configured. Applications should connect via this listener for automatic failover support.

Listener IP shows the IP address or addresses associated with the listener.

Primary Replica Failovers

The Primary Replica Failovers timeline visualizes which replica was primary at each point during the selected time range. This horizontal timeline shows the primary role assignment over time, with color-coded bars indicating which server held the primary role during each period.

Use this timeline to review recent failover history and understand failover frequency and patterns. Frequent failovers may indicate instability, while unexpected failovers during business hours require investigation. Correlate failover times with events from the Instance Overview or Events dashboards to identify what triggered role changes—planned maintenance, automatic failover due to health detection, or manual intervention.

Availability Group Nodes

The Availability Group Nodes table provides detailed configuration and status information for each replica:

Replica Instance shows the SQL Server instance name for each replica in the AG.

Replica role indicates whether the replica is currently PRIMARY or SECONDARY. Only one replica is primary at any time.

Operational State shows the current operational status of each replica (ONLINE, OFFLINE, etc.).

Sync. Health displays the per-replica synchronization status. “HEALTHY” indicates the replica is properly synchronized with the primary. Unhealthy states indicate synchronization problems requiring investigation.

Availability Mode shows whether the replica uses SYNCHRONOUS_COMMIT (waits for secondary acknowledgment before committing transactions, ensuring zero data loss) or ASYNCHRONOUS_COMMIT (commits without waiting, better performance but potential data loss during failover).

Failover Mode indicates whether the replica supports AUTOMATIC failover (can automatically become primary if the current primary fails) or MANUAL failover (requires manual intervention to become primary).

Seeding Mode shows whether the replica uses AUTOMATIC seeding (SQL Server automatically copies database files to initialize the replica) or MANUAL seeding (database files must be manually restored).

Secondary Allow Connections displays the read-intent settings for secondary replicas: NO (no connections allowed), READ_ONLY (only read-only connections allowed), or ALL (all connections allowed).

Backup Priority shows the priority value used for backup preference routing. Higher values indicate preferred backup targets when using AG-aware backup strategies.

Endpoint URL displays the database mirroring endpoint URL used for data movement between replicas.

R/O Routing URL shows the read-only routing address if configured, used to direct read-only queries to secondary replicas.

R/W Routing URL displays the read-write routing address if configured.

Important

Only replicas configured with SYNCHRONOUS_COMMIT and AUTOMATIC failover mode can participate in automatic failover. Asynchronous replicas or those with manual failover mode require manual intervention during primary failures, potentially increasing downtime.

Node Availability Metrics

The Node Availability section provides KPIs and visualization of replica online status:

Total Nodes KPI shows the configured replica count for quick reference.

Offline Nodes KPI displays how many replicas are currently offline. This should be 0 in a healthy AG.

Online Nodes chart plots the number of online replicas over time during the selected interval. A consistently flat line at the total node count indicates stable availability. Dips indicate periods when replicas went offline, while fluctuating lines suggest flapping replicas with intermittent connectivity issues.

Transfer Rates and Queue Sizes

These charts visualize data movement performance between primary and secondary replicas:

Transfer Rates chart displays:

Send Rate: How fast the primary replica sends transaction log records to secondaries, measured in MB/s or KB/s. Higher values indicate more transaction activity requiring replication.
Redo Rate: How fast secondary replicas apply received transaction log records to their databases. Redo rate should keep pace with send rate to maintain synchronization.

Low or decreasing redo rates on secondaries indicate performance bottlenecks. Common causes include slow storage I/O on secondaries, CPU pressure preventing redo threads from keeping up, or blocking on secondaries due to read workloads holding locks that conflict with redo operations.

Transfer Queue Size chart shows:

Send Queue Size: Amount of transaction log data (in KB or MB) waiting on the primary to be sent to secondaries. Growing send queues indicate network bandwidth limitations or secondary connectivity issues.
Redo Queue Size: Amount of transaction log data received by secondaries but not yet applied. Growing redo queues indicate secondaries cannot keep pace with transaction volume, creating synchronization lag that increases RPO (potential data loss during failover) and may delay failover readiness.

Warning

Large Redo Queues Impact Failover: High redo queue sizes mean secondaries are behind the primary and need time to catch up before becoming failover-ready. During automatic failover, SQL Server waits for redo queue processing, increasing failover duration and application downtime. Monitor and address redo queue buildup proactively.

Health History

The health history section provides time-series visualization of AG health trends:

Online Nodes History chart plots total nodes versus online nodes over time, showing historical availability patterns. Consistent alignment between total and online nodes indicates stable replica availability. Gaps indicate periods with offline replicas.

Database Health History chart shows total databases in the AG versus healthy databases over time. When these lines separate, one or more databases have become unsynchronized or unhealthy, requiring investigation. This may result from synchronization issues, suspended data movement, or database-specific problems.

Databases Replication Status

The Databases Replication Status table provides per-database synchronization details across all replicas:

SQL Instance shows which replica hosts each database copy.

Database Name identifies the database.

Sync. Health displays the synchronization status for this database on this replica. “HEALTHY” indicates proper synchronization with the primary, while other states indicate issues.

Is Primary Replica shows whether this row represents the primary copy (YES) or a secondary copy (NO).

Availability Mode displays the availability mode for this database replica, typically inheriting from the AG-level configuration but potentially overridden per database in some configurations.

Use this table to identify which specific databases have synchronization problems when overall AG health shows issues. Sort by Sync. Health to find unhealthy databases requiring attention. Filter by specific databases to see their status across all replicas.

Investigating Synchronization Issues

Growing Redo Queues: When redo queue sizes increase steadily, investigate secondary replica performance. Check the Instance Overview dashboard for the secondary—look for high CPU utilization, memory pressure, or I/O bottlenecks. Review the SQL Server error log on secondaries for redo thread errors or warnings. Consider whether read workloads on secondaries are causing blocking that interferes with redo operations.

Low Redo Rates: Consistently low redo rates relative to send rates indicate secondaries cannot keep pace with transaction volume. This may result from undersized hardware on secondaries compared to the primary, slow transaction log storage on secondaries, or configuration issues like insufficient max worker threads.

Increasing Send Queues: Growing send queues usually indicate network bandwidth limitations between replicas or secondary replicas that are offline or unreachable. Verify network connectivity, check for network saturation during peak periods, and ensure Windows Server Failover Clustering quorum is healthy.

Unhealthy Database Synchronization: Individual database synchronization issues may result from suspended data movement (check sys.dm_hadr_database_replica_states), insufficient secondary storage space preventing log application, or database-specific errors on the secondary (check SQL Server error log).

Frequent Failovers: Review the Primary Replica Failovers timeline to identify failover frequency and timing. Correlate failover times with system events, resource pressure, or operational activities. Unexpected automatic failovers may indicate intermittent primary replica health issues, aggressive health detection timeout settings, or network instability causing false failure detection.

Use the Instance Overview dashboard to check resource utilization and performance metrics on both primary and secondary replicas. High CPU, memory pressure, or I/O bottlenecks directly impact AG synchronization performance.

Review the SQL Server I/O Analysis dashboard to evaluate transaction log write performance on all replicas. Slow log I/O increases redo queue buildup and synchronization lag.

Check the Blocking dashboard if synchronization issues correlate with read workloads on secondary replicas. Blocking can interfere with redo thread operations and slow synchronization.

Monitor the Capacity Planning dashboard to ensure secondary replicas have adequate resources to handle both read workloads and redo operations without contention.

Best Practice

Regular Failover Testing: Periodically perform planned manual failovers during maintenance windows to verify automatic failover readiness, validate application connection failover behavior, and ensure all replicas can successfully assume the primary role. This validates your high availability configuration before an actual failure occurs.

Always On Availability Groups

Dashboard Overview

Availability Groups Table

Important

Using the Dashboard

Tip

Investigating Issues

Related Dashboards

Synchronous vs Asynchronous Commit

1 - Always On Availability Group Detail

Dashboard Sections

AG Summary

Primary Replica Failovers

Availability Group Nodes

Important

Node Availability Metrics

Transfer Rates and Queue Sizes

Warning

Health History

Databases Replication Status

Investigating Synchronization Issues

Related Dashboards

Best Practice