Reliability recommendations

Azure Advisor helps you ensure and improve the continuity of your business-critical applications. You can get reliability recommendations on the Reliability tab on the Advisor dashboard.

  1. Sign in to the Azure portal.

  2. Search for and select Advisor from any page.

  3. On the Advisor dashboard, select the Reliability tab.

AgFood Platform

Upgrade to the latest ADMA DotNet SDK version

We identified calls to an ADMA DotNet SDK version that is scheduled for deprecation. To ensure uninterrupted access to ADMA, latest features, and performance improvements, switch to the latest SDK version.

Potential benefits: Ensure uninterrupted access to ADMA

For More information, see What is Azure Data Manager for Agriculture?

Upgrade to the latest ADMA Java SDK version

We have identified calls to a ADMA Java Sdk version that is scheduled for deprecation. We recommend switching to the latest Sdk version to ensure uninterrupted access to ADMA, latest features, and performance improvements.

Potential benefits: Ensure uninterrupted access to ADMA

For More information, see What is Azure Data Manager for Agriculture?

Upgrade to the latest ADMA Python SDK version

We identified calls to an ADMA Python SDK version that is scheduled for deprecation. To ensure uninterrupted access to ADMA, latest features, and performance improvements, switch to the latest SDK version.

Potential benefits: Ensure uninterrupted access to ADMA

For More information, see What is Azure Data Manager for Agriculture?

Upgrade to the latest ADMA JavaScript SDK version

We identified calls to an ADMA JavaScript SDK version that is scheduled for deprecation. To ensure uninterrupted access to ADMA, latest features, and performance improvements, switch to the latest SDK version.

Potential benefits: Ensure uninterrupted access to ADMA

For More information, see What is Azure Data Manager for Agriculture?

API Management

Migrate API Management service to stv2 platform

Support for API Management instances hosted on the stv1 platform will be retired by 31 August 2024. Migrate to stv2 based platform before that to avoid service disruption.

Potential benefits: Improve service stability and leverage new platform features

For More information, see API Management stv1 platform retirement - Global Azure cloud (August 2024)

Hostname certificate rotation failed

The API Management service failing to refresh the hostname certificate from the Key Vault can lead to the service using a stale certificate and runtime API traffic being blocked. Ensure that the certificate exists in the Key Vault, and the API Management service identity is granted secret read access.

Potential benefits: Ensure service availability

For More information, see Configure a custom domain name for your Azure API Management instance

The legacy portal was deprecated 3 years ago and retired in October 2023. However, we are seeing active usage of the portal which may cause service disruption soon when we disable it.

We highly recommend that you migrate to the new developer portal as soon as possible to continue enjoying our services and take advantage of the new features and improvements.

Potential benefits: Ensure business continuity

For More information, see Migrate to the new developer portal

Dependency network status check failed

Azure API Management service dependency not available. Please, check virtual network configuration.

Potential benefits: Improve service stability

For More information, see Deploy your Azure API Management instance to a virtual network - external mode

SSL/TLS renegotiation blocked

SSL/TLS renegotiation attempt blocked; secure communication might fail. To support client certificate authentication scenarios, enable 'Negotiate client certificate' on listed hostnames. For browser-based clients, this option might result in a certificate prompt being presented to the client.

Potential benefits: Ensure service availability

For More information, see How to secure APIs using client certificate authentication in API Management

Deploy an Azure API Management instance to multiple Azure regions for increased service availability

Azure API Management supports multi-region deployment, which enables API publishers to add regional API gateways to an existing API Management instance. Multi-region deployment helps reduce request latency perceived by geographically distributed API consumers and improves service availability.

Potential benefits: Increased resilience against regional failures

For More information, see Deploy an Azure API Management instance to multiple Azure regions

Enable and configure autoscale for API Management instance on production workloads.

API Management instance in production service tiers can be scaled by adding and removing units. The autoscaling feature can dynamically adjust the units of an API Management instance to accommodate a change in load without manual intervention.

Potential benefits: Increase scalability and optimize cost.

For More information, see Automatically scale an Azure API Management instance

App Service

Scale out your App Service plan to avoid CPU exhaustion

High CPU utilization can lead to runtime issues with applications. Your application exceeded 90% CPU over the last couple of days. To reduce CPU usage and avoid runtime issues, scale out the application.

Potential benefits: Keep your app healthy

For More information, see Best practices for Azure App Service

Check your app's service health issues

We have a recommendation related to your app's service health. Open the Azure Portal, go to the app, click the Diagnose and Solve to see more details.

Potential benefits: Keep your app healthy

For More information, see Best practices for Azure App Service

Fix the backup database settings of your App Service resource

When an application has an invalid database configuration, its backups fail. For details, see your application's backup history on your app management page.

Potential benefits: Ensure business continuity

For More information, see Best practices for Azure App Service

Fix the backup storage settings of your App Service resource

When an application has invalid storage settings, its backups fail. For details, see your application's backup history on your app management page.

Potential benefits: Ensure business continuity

For More information, see Best practices for Azure App Service

Scale up your App Service plan SKU to avoid memory problems

The App Service Plan containing your application exceeded 85% memory allocation. High memory consumption can lead to runtime issues your applications. Find the problem application and scale it up to a higher plan with more memory resources.

Potential benefits: Keep your app healthy

For More information, see Best practices for Azure App Service

Scale out your App Service plan

Consider scaling out your App Service Plan to at least two instances to avoid cold start delays and service interruptions during routine maintenance.

Potential benefits: Optimize user experience and availability

For More information, see https://aka.ms/appsvcnuminstances

Fix application code, a worker process crashed due to an unhandled exception

A worker process in your application crashed due to an unhandled exception. To identify the root cause, collect memory dumps and call stack information at the time of the crash.

Potential benefits: Keep your app healthy and highly available

For More information, see https://aka.ms/appsvcproactivecrashmonitoring

Upgrade your App Service to a Standard plan to avoid request rejects

When an application is part of a shared App Service plan and meets its quota multiple times, incoming requests might be rejected. Your web application can’t accept incoming requests after meeting a quota. To remove the quota, upgrade to a Standard plan.

Potential benefits: Keep your app healthy

For More information, see Azure App Service plan overview

Move your App Service resource to Standard or higher and use deployment slots

When an application is deployed multiple times in a week, problems might occur. You deployed your application multiple times last week. To help you reduce deployment impact to your production web application, move your App Service resource to the Standard (or higher) plan, and use deployment slots.

Potential benefits: Keep your app healthy while updating

For More information, see Set up staging environments in Azure App Service

Consider upgrading the hosting plan of the Static Web App(s) in this subscription to Standard SKU.

The combined bandwidth used by all the Free SKU Static Web Apps in this subscription is exceeding the monthly limit of 100GB. Consider upgrading these applications to Standard SKU to avoid throttling.

Potential benefits: Higher availability for the apps by avoiding throttling.

For More information, see Pricing - Static Web Apps

Use deployment slots for your App Service resource

When an application is deployed multiple times in a week, problems might occur. You deployed your application multiple times over the last week. To help you manage changes and help reduce deployment impact to your production web application, use deployment slots.

Potential benefits: Keep your app healthy while updating

For More information, see Set up staging environments in Azure App Service

Consider changing your application architecture to 64-bit

Your App Service is configured as 32-bit, and its memory consumption is approaching the limit of 2 GB. If your application supports, consider recompiling your application and changing the App Service configuration to 64-bit instead.

Potential benefits: Improve your application reliability

For More information, see Application performance FAQs for Web Apps in Azure

CX Observer Personalized Recommendation

CX Observer Personalized Recommendation

Potential benefits: NA

App Service Certificates

Domain verification required to issue your App Service Certificate

You have an App Service Certificate that's currently in a Pending Issuance status and requires domain verification. Failure to validate domain ownership will result in an unsuccessful certificate issuance. Domain verification isn't automated for App Service Certificates and will require action. If you've recently verified domain ownership and have been issued a certificate, you may disregard this message.

Potential benefits: Ensure successful issuance of App Service Certificate.

For More information, see Add and manage TLS/SSL certificates in Azure App Service

Application Gateway

Upgrade your SKU or add more instances

Deploying two or more medium or large sized instances ensures business continuity (fault tolerance) during outages caused by planned or unplanned maintenance.

Potential benefits: Ensure business continuity through application gateway resilience

For More information, see Multi-region load balancing - Azure Reference Architectures

Avoid hostname override to ensure site integrity

Avoid overriding the hostname when configuring Application Gateway. Having a domain on the frontend of Application Gateway different than the one used to access the backend, can lead to broken cookies or redirect URLs. Make sure the backend is able to deal with the domain difference, or update the Application Gateway configuration so the hostname doesn't need to be overwritten towards the backend. When used with App Service, attach a custom domain name to the Web App and avoid use of the *.chinacloudsites.cn host name towards the backend. Note that a different frontend domain isn't a problem in all situations, and certain categories of backends like REST APIs, are less sensitive in general.

Potential benefits: Ensure site integrity and avoid broken cookies or redirect urls through a resilient Application Gateway configuration.

For More information, see Troubleshoot App Service issues in Application Gateway

Implement ExpressRoute Monitor on Network Performance Monitor

When ExpressRoute circuit isn't monitored by ExpressRoute Monitor on Network Performance, you miss notifications of loss, latency, and performance of on-premises to Azure resources, and Azure to on-premises resources. For end-to-end monitoring, implement ExpressRoute Monitor on Network Performance.

Potential benefits: Improve time-to-detect and time-to-mitigate issues in your network and provide insights on your network path via ExpressRoute

For More information, see Configure Network Performance Monitor for ExpressRoute (deprecated)

Implement multiple ExpressRoute circuits in your Virtual Network for cross premises resiliency

When an ExpressRoute gateway only has one ExpressRoute circuit associated to it, resiliency issues might occur. To ensure peering location redundancy and resiliency, connect one or more additional circuits to your gateway.

Potential benefits: Improve resiliency in case of ExpressRoute peering location failure

For More information, see Designing for high availability with ExpressRoute

Add at least one more endpoint to the profile, preferably in another Azure region

Profiles need more than one endpoint to ensure availability if one of the endpoints fails. We also recommend that endpoints be in different regions.

Potential benefits: Improve resiliency by allowing failover

For More information, see Traffic Manager endpoints

Add an endpoint configured to "All (World)"

For geographic routing, traffic is routed to endpoints in defined regions. When a region fails, there is no pre-defined failover. Having an endpoint where the Regional Grouping is configured to "All (World)" for geographic profiles avoids traffic black holing and guarantees service availablity.

Potential benefits: Improve resiliency by avoiding traffic black holes

For More information, see Add, disable, enable, delete, or move endpoints

Add or move one endpoint to another Azure region

All endpoints associated to this proximity profile are in the same region. Users from other regions may experience long latency when attempting to connect. Adding or moving an endpoint to another region will improve overall performance for proximity routing and provide better availability if all endpoints in one region fail.

Potential benefits: Improve resiliency by allowing failover to another region

For More information, see Configure the performance traffic routing method

Move to production gateway SKUs from Basic gateways

The Basic VPN SKU is for development or testing scenarios. If you're using the VPN gateway for production, move to a production SKU, which offers higher numbers of tunnels, Border Gateway Protocol (BGP), active-active configuration, custom IPsec/IKE policy, and increased stability and availability.

Potential benefits: Additional available features and higher stability and availability

For More information, see About VPN Gateway configuration settings

Enable Active-Active gateways for redundancy

In active-active configuration, both instances of the VPN gateway establish site-to-site (S2S) VPN tunnels to your on-premise VPN device. When a planned maintenance or unplanned event happens to one gateway instance, traffic is automatically switched over to the other active IPsec tunnel.

Potential benefits: Ensure business continuity through connection resilience

For More information, see Design highly available gateway connectivity for cross-premises and VNet-to-VNet connections

Disable health probes when there is only one origin in an origin group

If you only have a single origin, Front Door always routes traffic to that origin even if its health probe reports an unhealthy status. The status of the health probe doesn't do anything to change Front Door's behavior. In this scenario, health probes don't provide a benefit.

Potential benefits: Ensure service availability by reducing unnecessary health probe traffic

For More information, see Best practices for Front Door

Use managed TLS certificates

When Front Door manages your TLS certificates, it reduces your operational costs, and helps you to avoid costly outages caused by forgetting to renew a certificate. Front Door automatically issues and rotates the managed TLS certificates.

Potential benefits: Ensure service availability by having Front Door manage and rotate your certificates

For More information, see Best practices for Front Door

Use NAT gateway for outbound connectivity

Prevent connectivity failures due to source network address translation (SNAT) port exhaustion by using NAT gateway for outbound traffic from your virtual networks. NAT gateway scales dynamically and provides secure connections for traffic headed to the internet.

Potential benefits: Prevent outbound connection failures with NAT gateway

For More information, see Use Source Network Address Translation (SNAT) for outbound connections

Deploy your Application Gateway across Availability Zones

Achieve zone redundancy by deploying Application Gateway across Availability Zones. Zone redundancy boosts resilience by enabling Application Gateway to survive various outages, which ensures continuity even if one zone is affected, and enhances overall reliability.

Potential benefits: Resiliency of Application Gateways is considerably increased when using Availability Zones.

For More information, see Scaling Application Gateway v2 and WAF v2

Update VNet permission of Application Gateway users

To improve security and provide a more consistent experience across Azure, all users must pass a permission check to create or update an Application Gateway in a Virtual Network. The users or service principals minimum permission required is Microsoft.Network/virtualNetworks/subnets/join/action.

Potential benefits: Avoid disruptions in management of Application Gateway resource

For More information, see Application Gateway infrastructure configuration

Use the same domain name on Front Door and your origin

When you rewrite the Host header, request cookies and URL redirections might break. When you use platforms like Azure App Service, features like session affinity and authentication and authorization might not work correctly. Make sure to validate whether your application is going to work correctly.

Potential benefits: Ensure application integrity by preserving original host name

For More information, see Best practices for Front Door

Implement Site Resiliency for ExpressRoute

To ensure maximum resiliency, Microsoft recommends that you connect to two ExpressRoute circuits in two peering locations. The goal of Maximum Resiliency is to enhance availability and ensure the highest level of resilience for critical workloads.

Potential benefits: Maximum Resiliency in ExpressRoute is designed to ensure there isn’t a single point of failure within the Microsoft network path. This is achieved by offering dual (2) circuits across two different locations for site diversity in ExpressRoute. The goal of Maximum Resiliency is to enhance availability and ensure the highest level of resilience for critical workloads.

For More information, see Design and architect Azure ExpressRoute for resiliency

Implement Zone Redundant ExpressRoute Gateways

Implement zone-redundant Virtual Network Gateway in Azure Availability Zones. This brings resiliency, scalability, and higher availability to your Virtual Network Gateways.

Potential benefits: Provides zonal resiliency and redundancy for ExpressRoute

For More information, see Create a zone-redundant virtual network gateway in availability zones

Ensure autoscaling is used for increased performance and resiliency

When configuring the Application Gateway, it's recommended to provision autoscaling to scale in and out in response to changes in demand. This helps to minimize the effects of a single failing component.

Potential benefits: Increase performance and resiliency.

For More information, see Scaling Application Gateway v2 and WAF v2

ExpressRoute IP routes nearing specified limit

Your ExpressRoute circuit is close to reaching its IP route limits. Exceeding these limits will disrupt the connectivity. Connectivity will restore once routes are within limits Suggestions: Regularly monitor route counts. Explore Virtual WAN RouteMap to reduce advertised IP routes.

Potential benefits: Monitoring IP route counts prevents connectivity issues and ensures stability.

For More information, see Virtual WAN FAQ

Change subnet of V1 gateway named GatewaySubnet as it's reserved for VPN/Express Route

Your Application Gateway is at risk of deletion after October 2024 due to a failed internal upgrade. This is due to subnet named Gatewaysubnet, which is reserved for VPN/ExpressRoute. To resolve, please change the subnet or migrate to V2. Allow a day for the message to disappear once fixed

Potential benefits: Avoid disruption in management of Application Gateway V1 resource

For More information, see Frequently asked questions about Application Gateway

Change subnet of V1 gateway as the current subnet contains a NAT gateway

Your Application Gateway may be deleted after October 2024 due to a failed internal upgrade. This is because it lacks a dedicated subnet and contains a NAT Gateway. To resolve, either change the subnet, remove the NAT Gateway, or migrate to V2. Allow a day for the message to disappear once fixed

Potential benefits: Avoid disruption in management of Application Gateway V1 resource

For More information, see Frequently asked questions about Application Gateway

Reactivate the Subscription to unblock internal upgrade for V1 gateway

Your Application Gateway is at risk of deletion after October 2024 due to a failed internal upgrade. This is because the subscription is in a non Active state. To fix this, please activate the subscription. Allow a day for this message to disappear once the issue is fixed.

Potential benefits: Avoid disruption in management of Application Gateway V1 resource

For More information, see Reactivate a disabled Azure subscription

Application Gateway for Containers

Migrate to supported version of AGC

The version of Application Gateway for Containers was provisioned with a preview version and isn't supported for production. Ensure you provision a new gateway using the latest API version.

Potential benefits: Ensure supportability and resiliency for production workloads

For More information, see What is Application Gateway for Containers?

Create a Standard search service (2GB)

When you exceed your storage quota, indexing operations stop working. You're close to exceeding your storage quota of 2GB. If you need more storage, create a Standard search service or add extra partitions.

Potential benefits: capability to handle more data

For More information, see https://aka.ms/azs/search-limits-quotas-capacity

Create a Standard search service (50MB)

When you exceed your storage quota, indexing operations stop working. You're close to exceeding your storage quota of 50MB. To maintain operations, create a Basic or Standard search service.

Potential benefits: capability to handle more data

For More information, see https://aka.ms/azs/search-limits-quotas-capacity

Avoid exceeding your available storage quota by adding more partitions

When you exceed your storage quota, you can still query, but indexing operations stop working. You're close to exceeding your available storage quota. If you need more storage, add extra partitions.

Potential benefits: Able to index additional data

For More information, see https://aka.ms/azs/search-limits-quotas-capacity

Azure Arc-enabled Kubernetes

Upgrade to the latest agent version of Azure Arc-enabled Kubernetes

For the best Azure Arc enabled Kubernetes experience, improved stability and new functionality, upgrade to the latest agent version.

Potential benefits: Arc-enabled K8s latest agent version

For More information, see Upgrade Azure Arc-enabled Kubernetes agents

Azure Arc-enabled Kubernetes Configuration

Upgrade Microsoft Flux extension to the newest major version

The Microsoft Flux extension has a major version release. Plan for a manual upgrade to the latest major version for Microsoft Flux for all Azure Arc-enabled Kubernetes and Azure Kubernetes Service (AKS) clusters within 6 months for continued support and new functionality.

Potential benefits: Continued support and new functionality

For More information, see Available extensions for Azure Arc-enabled Kubernetes clusters

Upcoming Breaking Changes for Microsoft Flux Extension

The Microsoft Flux extension frequently receives updates for security and stability. The upcoming update, in line with the OSS Flux Project, will modify the HelmRelease and HelmChart APIs by removing deprecated fields. To avoid disruption to your workloads, necessary action is needed.

Potential benefits: Improved stability, security, and new functionality

For More information, see Available extensions for Azure Arc-enabled Kubernetes clusters

Upgrade Microsoft Flux extension to a supported version

Current version of Microsoft Flux on one or more Azure Arc enabled clusters and Azure Kubernetes clusters is out of support. To get security patches, bug fixes and Microsoft support, upgrade to a supported version.

Potential benefits: Get security patches, bug fixes and Microsoft support

For More information, see Available extensions for Azure Arc-enabled Kubernetes clusters

Azure Arc-enabled servers

Upgrade to the latest version of the Azure Connected Machine agent

The Azure Connected Machine agent is updated regularly with bug fixes, stability enhancements, and new functionality. For the best Azure Arc experience, upgrade your agent to the latest version.

Potential benefits: Improved stability and new functionality

For More information, see Managing and maintaining the Connected Machine agent

Azure Cache for Redis

Increase fragmentation memory reservation

Fragmentation and memory pressure can cause availability incidents. To help in reduce cache failures when running under high memory pressure, increase reservation of memory for fragmentation through the maxfragmentationmemory-reserved setting available in the Advanced Settings options.

Potential benefits: Avoid availability incidents when your cache has high memory fragmentation

For More information, see How to configure Azure Cache for Redis

Configure geo-replication for Cache for Redis instances to increase durability of applications

Geo-Replication enables disaster recovery for cached data, even in the unlikely event of a widespread regional failure. This can be essential for mission-critical applications. We recommend that you configure passive geo-replication for Premium Azure Cache for Redis instances.

Potential benefits: Geo-Replication enables disaster recovery for cached data.

For More information, see Configure passive geo-replication for Premium Azure Cache for Redis instances

Azure Container Apps

Re-create your your Container Apps environment to avoid DNS issues

There's a potential networking issue with your Container Apps environments that might cause DNS issues. We recommend that you create a new Container Apps environment, re-create your Container Apps in the new environment, and delete the old Container Apps environment.

Potential benefits: Avoid DNS failures in your Container Apps Environment.

For More information, see Quickstart: Deploy your first container app using the Azure portal

Renew custom domain certificate

The custom domain certificate you uploaded is near expiration. To prevent possible service downtime, renew your certificate and upload the new certificate for your container apps.

Potential benefits: Your service wont fail because of expired certificate.

For More information, see Custom domain names and bring your own certificates in Azure Container Apps

An issue has been detected that is preventing the renewal of your Managed Certificate.

We detected the managed certificate used by the Container App has failed to auto renew. Follow the documentation link to make sure that the DNS settings of your custom domain are correct.

Potential benefits: Avoid downtime due to an expired certificate.

For More information, see Custom domain names and free managed certificates in Azure Container Apps

Increase the minimal replica count for your containerized application

The minimal replica count set for your Azure Container App containerized application might be too low, which can cause resilience, scalability, and load balancing issues. For better availability, consider increasing the minimal replica count.

Potential benefits: Better availability for your container app.

For More information, see Set scaling rules in Azure Container Apps

Azure Cosmos DB

Configure Azure Cosmos DB containers with a partition key

When Azure Cosmos DB nonpartitioned collections reach their provisioned storage quota, you lose the ability to add data. Your Cosmos DB nonpartitioned collections are approaching their provisioned storage quota. Migrate these collections to new collections with a partition key definition so they can automatically be scaled out by the service.

Potential benefits: Scale your containers seamlessly with increase in storage or request rates without running into any limits

For More information, see Partitioning and horizontal scaling in Azure Cosmos DB

Use static Cosmos DB client instances in your code and cache the names of databases and collections

A high number of metadata operations on an account can result in rate limiting. Metadata operations have a system-reserved request unit (RU) limit. Avoid rate limiting from metadata operations by using static Cosmos DB client instances in your code and caching the names of databases and collections.

Potential benefits: Optimize your RU usage and avoid rate limiting

For More information, see Performance tips for Azure Cosmos DB and .NET SDK v2

Check linked Azure Key Vault hosting your encryption key

When an Azure Cosmos DB account can't access its linked Azure Key Vault hosting the encyrption key, data access and security issues might happen. Your Azure Key Vault's configuration is preventing your Cosmos DB account from contacting the key vault to access your managed encryption keys. If you recently performed a key rotation, ensure that the previous key, or key version, remains enabled and available until Cosmos DB completes the rotation. The previous key or key version can be disabled after 24 hours, or after the Azure Key Vault audit logs don't show any activity from Azure Cosmos DB on that key or key version.

Potential benefits: Update your configurations to continue using customer-managed keys and access your data

For More information, see Configure customer-managed keys for your Azure Cosmos DB account with Azure Key Vault

Configure consistent indexing mode on Azure Cosmos DB containers

Azure Cosmos containers configured with the Lazy indexing mode update asynchronously, which improves write performance, but can impact query freshness. Your container is configured with the Lazy indexing mode. If query freshness is critical, use Consistent Indexing Mode for immediate index updates.

Potential benefits: Improve query result consistency and reliability

For More information, see Manage indexing policies in Azure Cosmos DB

Hotfix - Upgrade to 2.6.14 version of the Async Java SDK v2 or to Java SDK v4

There's a critical bug in version 2.6.13 (and lower) of the Azure Cosmos DB Async Java SDK v2 causing errors when a Global logical sequence number (LSN) greater than the Max Integer value is reached. The error happens transparently to you by the service after a large volume of transactions occur in the lifetime of an Azure Cosmos DB container. Note: While this is a critical hotfix for the Async Java SDK v2, we still highly recommend you migrate to the Java SDK v4.

Potential benefits: If action isn’t taken, all create, read, update, and delete operations may begin to fail with NumberFormatException

For More information, see Azure Cosmos DB Async Java SDK for API for NoSQL (legacy): Release notes and resources

There's a critical bug in version 4.15 and lower of the Azure Cosmos DB Java SDK v4 causing errors when a Global logical sequence number (LSN) greater than the Max Integer value is reached. This happens transparently to you by the service after a large volume of transactions occur in the lifetime of an Azure Cosmos DB container. Avoid this problem by upgrading to the current recommended version of the Java SDK v4

Potential benefits: If action isn’t taken, all create, read, update, and delete operations may begin to fail with NumberFormatException

For More information, see Azure Cosmos DB Java SDK v4 for API for NoSQL: release notes and resources

Use the new 3.6+ endpoint to connect to your upgraded Azure Cosmos DB's API for MongoDB account

Some of your applications are connecting to your upgraded Azure Cosmos DB's API for MongoDB account using the legacy 3.2 endpoint - [accountname].documents.azure.cn. Use the new endpoint - [accountname].mongo.cosmos.azure.com (or its equivalent in sovereign, government, or restricted clouds).

Potential benefits: Take advantage of the latest features in version 3.6+ of Azure Cosmos DB's API for MongoDB

For More information, see Azure Cosmos DB for MongoDB (4.0 server version): supported features and syntax

Upgrade your Azure Cosmos DB API for MongoDB account to v4.2 to save on query/storage costs and utilize new features

Your Azure Cosmos DB API for MongoDB account is eligible to upgrade to version 4.2. Upgrading to v4.2 can reduce your storage costs by up to 55% and your query costs by up to 45% by leveraging a new storage format. Numerous additional features such as multi-document transactions are also included in v4.2.

Potential benefits: Improved reliability, query/storage efficiency, performance, and new feature capabilities

For More information, see Upgrade the API version of your Azure Cosmos DB for MongoDB account

Enable Server Side Retry (SSR) on your Azure Cosmos DB's API for MongoDB account

When an account is throwing a TooManyRequests error with the 16500 error code, enabling Server Side Retry (SSR) can help mitigate the issue.

Potential benefits: Prevent throttling and improve your query reliability and performance

Add a second region to your production workloads on Azure Cosmos DB

Production workloads on Azure Cosmos DB run in a single region might have availability issues, this appears to be the case with some of your Cosmos DB accounts. Increase their availability by configuring them to span at least two Azure regions. NOTE: Additional regions incur additional costs.

Potential benefits: Improve the availability of your production workloads

For More information, see High availability (Reliability) in Azure Cosmos DB for NoSQL

Upgrade old Azure Cosmos DB SDK to the latest version

An Azure Cosmos DB account using an old version of the SDK lacks the latest fixes and improvements. Your Azure Cosmos DB account is using an old version of the SDK. For the latest fixes, performance improvements, and new feature capabilities, upgrade to the latest version.

Potential benefits: Improved reliability, performance, and new feature capabilities

For More information, see Azure Cosmos DB documentation

Upgrade outdated Azure Cosmos DB SDK to the latest version

An Azure Cosmos DB account using an old version of the SDK lacks the latest fixes and improvements. Your Azure Cosmos DB account is using an outdated version of the SDK. We recommend upgrading to the latest version for the latest fixes, performance improvements, and new feature capabilities.

Potential benefits: Improved reliability, performance, and new feature capabilities

For More information, see Azure Cosmos DB documentation

Enable service managed failover for Cosmos DB account

Enable service managed failover for Cosmos DB account to ensure high availability of the account. Service managed failover automatically switches the write region to the secondary region in case of a primary region outage. This ensures that the application continues to function without any downtime.

Potential benefits: Azure's Service-Managed Failover feature enhances system availability by automating failover processes, reducing downtime, and improving resilience.

For More information, see High availability (Reliability) in Azure Cosmos DB for NoSQL

Enable HA for your Production workload

Many clusters with consistent workloads do not have high availability (HA) enabled. It's recommended to activate HA from the Scale page in the Azure Portal to prevent database downtime in case of unexpected node failures and to qualify for SLA guarantees.

Potential benefits: Activate HA to avoid database downtime in case of an unexpected node failure

For More information, see Scaling and configuring Your Azure Cosmos DB for MongoDB vCore cluster

Enable zone redundancy for multi-region Cosmos DB accounts

This recommendation suggests enabling zone redundancy for multi-region Cosmos DB accounts to improve high availability and reduce the risk of data loss in case of a regional outage.

Potential benefits: Improved high availability and reduced risk of data loss

For More information, see High availability (Reliability) in Azure Cosmos DB for NoSQL

Avoid being rate limited for Control Plane operation

We found high number of Control Plane operations on your account through resource provider. Request that exceeds the documented limits at sustained levels over consecutive 5-minute periods may experience request being throttling as well failed or incomplete operation on Azure Cosmos DB resources.

Potential benefits: Optimize control plane operation and avoid operation failure due to rate limiting

For More information, see Azure Cosmos DB service quotas

Azure Data Explorer

Resolve virtual network issues

Service failed to install or resume due to virtual network (VNet) issues. To resolve this issue, follow the steps in the troubleshooting guide.

Potential benefits: Improve reliability, availability, performance, and new feature capabilities

For More information, see Troubleshoot access, ingestion, and operation of your Azure Data Explorer cluster in your virtual network

Add subnet delegation for 'Microsoft.Kusto/clusters'

If a subnet isn’t delegated, the associated Azure service won’t be able to operate within it. Your subnet doesn’t have the required delegation. Delegate your subnet for 'Microsoft.Kusto/clusters'.

Potential benefits: Improve reliability, availability, performance, and new feature capabilities

For More information, see What is subnet delegation?

Azure Database for MySQL

High Availability - Add primary key to the table that currently doesn't have one.

Our internal monitoring system has identified significant replication lag on the High Availability standby server. This lag is primarily caused by the standby server replaying relay logs on a table that lacks a primary key. To address this issue and adhere to best practices, it's recommended to add primary keys to all tables. Once this is done, proceed to disable and then re-enable High Availability to mitigate the problem.

Potential benefits: By implementing this approach, the standby server will be shielded from the adverse effects of high replication lag caused by the absence of a primary key on any table. This approach can contribute to reduced failover times, ultimately supporting the goal of maintaining business continuity.

For More information, see Troubleshoot replication latency in Azure Database for MySQL - Flexible Server

Replication - Add a primary key to the table that currently doesn't have one

Our internal monitoring observed significant replication lag on your replica server because the replica server is replaying relay logs on a table that lacks a primary key. To ensure that the replica server can effectively synchronize with the primary and keep up with changes, add primary keys to the tables in the primary server and then recreate the replica server.

Potential benefits: By implementing this approach, the replica server will achieve a state of close synchronization with the primary server.

For More information, see Troubleshoot replication latency in Azure Database for MySQL - Flexible Server

Azure Database for PostgreSQL

Remove inactive logical replication slots (important)

Inactive logical replication slots can result in degraded server performance and unavailability due to write ahead log (WAL) file retention and buildup of snapshot files. Your Azure Database for PostgreSQL flexible server might have inactive logical replication slots. THIS NEEDS IMMEDIATE ATTENTION. Either delete the inactive replication slots, or start consuming the changes from these slots, so that the slots' Log Sequence Number (LSN) advances and is close to the current LSN of the server.

Potential benefits: Improve PostgreSQL availability by removing inactive logical replication slots

For More information, see Logical replication and logical decoding in Azure Database for PostgreSQL - Flexible Server

Remove inactive logical replication slots

When an Orcas PostgreSQL flexible server has inactive logical replication slots, degraded server performance and unavailability due to write ahead log (WAL) file retention and buildup of snapshot files might occur. THIS NEEDS IMMEDIATE ATTENTION. Either delete the inactive replication slots, or start consuming the changes from these slots, so that the slots' Log Sequence Number (LSN) advances and is close to the current LSN of the server.

Potential benefits: Improve PostgreSQL availability by removing inactive logical replication slots

For More information, see Logical decoding

Configure geo redundant backup storage

Configure GRS to ensure that your database meets its availability and durability targets even in the face of failures or disasters.

Potential benefits: Ensures recovery from regional failure or disaster.

For More information, see Backup and restore in Azure Database for PostgreSQL - Flexible Server

Define custom maintenance windows to occur during low-peak hours

When specifying preferences for the maintenance schedule, you can pick a day of the week and a time window. If you don't specify, the system will pick times between 11pm and 7am in your server's region time. Pick a day and time where usage is low.

Potential benefits: Configure maintenance window enables avoiding maintenance during system peak.

For More information, see Scheduled maintenance in Azure Database for PostgreSQL - Flexible Server

Azure IoT Hub

Upgrade Microsoft Edge device runtime to a supported version for IoT Hub

When Edge devices use outdated versions, performance degradation might occur. We recommend you upgrade to the latest supported version of the Azure IoT Edge runtime.

Potential benefits: Ensure business continuity with latest supported version for your Edge devices

For More information, see Update IoT Edge

Upgrade device client SDK to a supported version for IotHub

When devices use an outdated SDK, performance degradation can occur. Some or all of your devices are using an outdated SDK. We recommend you upgrade to a supported SDK version.

Potential benefits: Ensure business continuity with supported SDK for your devices

For More information, see Azure IoT Hub SDKs

IoT Hub Potential Device Storm Detected

This is when two or more devices are trying to connect to the IoT Hub using the same device ID credentials. When the second device (B) connects, it causes the first one (A) to become disconnected. Then (A) attempts to reconnect again, which causes (B) to get disconnected.

Potential benefits: Improve connectivity of your devices

For More information, see Understand and resolve Azure IoT Hub errors

Add IoT Hub units or increase SKU level

When an IoT Hub exceeds its daily message quota, operation and cost problems might occur. To ensure smooth operation in the future, add units or increase the SKU level.

Potential benefits: The IoT Hub can receive messages again.

For More information, see Understand and resolve Azure IoT Hub errors

Azure Kubernetes Service (AKS)

Enable Autoscaling for your system node pools

To ensure your system pods are scheduled even during times of high load, enable autoscaling on your system node pool.

Potential benefits: Enabling Autoscaler for system node pool ensures system pods are scheduled and cluster can function.

For More information, see Use the cluster autoscaler in Azure Kubernetes Service (AKS)

Have at least 2 nodes in your system node pool

Ensure your system node pools have at least 2 nodes for reliability of your system pods. With a single node, your cluster can fail in the event of a node or hardware failure.

Potential benefits: Having 2 nodes ensures resiliency against node failures.

For More information, see Manage system node pools in Azure Kubernetes Service (AKS)

Create a dedicated system node pool

A cluster without a dedicated system node pool is less reliable. We recommend you dedicate system node pools to only serve critical system pods, preventing resource starvation between system and competing user pods. Enforce this behavior with the CriticalAddonsOnly=true:NoSchedule taint on the pool.

Potential benefits: Ensures cluster reliability by preventing resource scarcity for core system pods

For More information, see Manage system node pools in Azure Kubernetes Service (AKS)

Ensure B-series Virtual Machine's (VMs) aren't used in production environments

When a cluster has one or more node pools using a non-recommended burstable VM SKU, full vCPU capability 100% is unguaranteed. Ensure B-series VM's aren't used in production environments.

Potential benefits: Best practice for consistent performance

For More information, see Bv1 sizes series

Azure NetApp Files

Configure AD DS Site for Azure Netapp Files AD Connector

If Azure NetApp Files can't reach assigned AD DS site domain controllers, the domain controller discovery process queries all domain controllers. Unreachable domain controllers may be used, causing issues with volume creation, client queries, authentication, and AD connection modifications.

Potential benefits: Optimize DNS Connectivity with Azure Netapp Files

For More information, see Understand guidelines for Active Directory Domain Services site design and planning for Azure NetApp Files

Ensure Roles assigned to Microsoft.NetApp Delegated Subnet has Subnet Read Permissions

Roles that are required for the management of Azure NetApp Files resources, must have "Microsoft.network/virtualNetworks/subnets/read" permissions on the subnet that is delegated to Microsoft.NetApp If the role, whether Custom or Built-In doesn't have this permission, then Volume Creations will fail

Potential benefits: Prevent volume creation failures by ensuring subnet/read permissions

Implement disaster recovery strategies for your Azure NetApp Files resources

To avoid data or functionality loss during a regional or zonal disaster, implement common disaster recovery techniques such as cross region replication or cross zone replication for your Azure NetApp Files volumes.

Potential benefits: Manage disaster recovery easily with Azure NetApp Files replication features

For More information, see Understand data protection and disaster recovery options in Azure NetApp Files

Azure Netapp Files - Enable Continuous Availability for SMB Volumes

For Continuous Availability, we recommend enabling Server Message Block (SMB) volume for your Azure Netapp Files.

Potential benefits: Prevent application disruptions by enabling Continuous Availability for SMB volumes

For More information, see Enable Continuous Availability on existing SMB volumes

Azure Site Recovery

Enable soft delete for your Recovery Services vaults

Soft delete helps you retain your backup data in the Recovery Services vault for an additional duration after deletion, giving you an opportunity to retrieve it before it's permanently deleted.

Potential benefits: Helps recovery of backup data in cases of accidental deletion

For More information, see Soft delete for Azure Backup

Enable Cross Region Restore for your recovery Services Vault

Cross Region Restore (CRR) allows you to restore Azure VMs in a secondary region (an Azure paired region), helping with disaster recovery.

Potential benefits: As one of the restore options, Cross Region Restore (CRR) allows you to restore Azure VMs in a secondary region, which is an Azure paired region.

For More information, see How to restore Azure VM data in Azure portal

Azure Spring Apps

Upgrade Application Configuration Service to Gen 2

We notice you are still using Application Configuration Service Gen1 which will be end of support by April 2024. Application Configuration Service Gen2 provides better performance compared to Gen1 and the upgrade from Gen1 to Gen2 is zero downtime so we recommend to upgrade as soon as possible.

Potential benefits: Higher stability and availability

For More information, see Use Application Configuration Service for Tanzu

Azure SQL Database

Enable cross region disaster recovery for SQL Database

Enable cross region disaster recovery for Azure SQL Database for business continuity in the event of regional outage.

Potential benefits: Enabling disaster recovery creates a continuously synchronized readable secondary database for a primary database.

For More information, see Overview of business continuity with Azure SQL Database

Enable zone redundancy for Azure SQL Database to achieve high availability and resiliency.

To achieve high availability and resiliency, enable zone redundancy for the SQL database or elastic pool to use availability zones and ensure the database or elastic pool is resilient to zonal failures.

Potential benefits: Enabling zone redundancy ensures Azure SQL Database is resilient to zonal hardware and software failures and the recovery is transparent to applications.

For More information, see Availability through redundancy - Azure SQL Database

Azure Stack HCI

Upgrade to the latest version of AKS enabled by Arc

Upgrade to the latest version of API/SDK of AKS enabled by Azure Arc for new functionality and improved stability.

Potential benefits: The latest version of AKS enabled by Azure Arc with new functionality and improved stability.

For More information, see https://azure.github.io/azure-sdk/releases/latest/index.html

Upgrade to the latest version of AKS enabled by Arc

Upgrade to the latest version of API/SDK of AKS enabled by Azure Arc for new functionality and improved stability.

Potential benefits: The latest version of AKS enabled by Azure Arc with new functionality and improved stability.

For More information, see https://azure.github.io/azure-sdk/releases/latest/index.html

Classic deployment model storage

Action required: Migrate classic storage accounts by 8/30/2024.

Migrate your classic storage accounts to Azure Resource Manager to ensure business continuity. Azure Resource Manager will provide all of the same functionality plus a consistent management layer, resource grouping, and access to new features and updates.

Potential benefits: Ensure the ability to manage your data by migrating your classic storage account(s)

Classic deployment model virtual machine

Migrate off Cloud Services (classic) before 31 August 2024

Cloud Services (classic) is retiring. To avoid any loss of data or business continuity, migrate off before 31 Aug 2024.

Potential benefits: Continuity of your service

For More information, see Migrate Azure Cloud Services (classic) to Azure Cloud Services (extended support)

Cognitive Services

Container Registry

Use Premium tier for critical production workloads

Premium registries provide the highest amount of included storage, concurrent operations and network bandwidth, enabling high-volume scenarios. The Premium tier also adds features such as geo-replication, availability zone support, content-trust, customer-managed keys and private endpoints.

Potential benefits: The Premium tier provides the highest amount of performance, scale and resiliency options

For More information, see Azure Container Registry service tiers

Ensure Geo-replication is enabled for resilience

Geo-replication enables workloads to use a single image, tag and registry name across regions, provides network-close registry access, reduced data transfer costs and regional Registry resilience if a regional outage occurs. This feature is only available in the Premium service tier.

Potential benefits: Improved resilience and pull performance, simplified registry management and reduced data transfer costs

For More information, see Geo-replication in Azure Container Registry

Content Delivery Network

Azure CDN From Edgio, Managed Certificate Renewal Unsuccessful. Additional Validation Required.

Azure CDN from Edgio employs CNAME delegation to renew certificates with DigiCert for managed certificate renewals. It's essential that Custom Domains resolve to an azureedge.net endpoint for the automatic renewal process with DigiCert to be successful. Ensure your Custom Domain's CNAME and CAA records are configured correctly. Should you require further assistance, please submit a support case to Azure to re-attempt the renewal request.

Potential benefits: Ensure service availability.

Data Factory

Implement BCDR strategy for cross region redundancy in Azure Data Factory

Implementing BCDR strategy improves high availability and reduced risk of data loss

Potential benefits: Improves high availability and reduced risk of data loss

For More information, see BCDR for Azure Data Factory and Azure Synapse Analytics pipelines - Azure Architecture Center

Enable auto upgrade on your SHIR

Auto-upgrade of Self-hosted Integration runtime has been disabled. Know that you aren't getting the latest changes and bug fixes on the Self-Hosted Integration runtime. Review them to enable the SHIR auto upgrade

Potential benefits: To get the latest changes and bug fixes on the Self-Hosted Integration runtime

For More information, see Self-hosted integration runtime autoupdate and expire notification

Fluid Relay

Azure Fluid Relay client library should be upgraded

If the Azure Fluid Relay service is invoked with an old client library, it might cause appplication problems. To ensure your application remains operational, upgrade your Azure Fluid Relay client library to the latest version. Upgrading provides the most up-to-date functionality, and enhancements in performance and stability.

Potential benefits: Improved reliability

For More information, see Version compatibility with Fluid Framework releases

HDInsight

Apply critical updates by dropping and recreating your HDInsight clusters (certificate rotation round 2)

The HDInsight service attempted to apply a critical certificate update on your running clusters. However, due to some custom configuration changes, we're unable to apply the updates on all clusters. To prevent those clusters from becoming unhealthy and unusable, drop and recreate your clusters.

Potential benefits: Ensure cluster health and stability

For More information, see Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more

Non-ESP ABFS clusters [Cluster Permissions for Word Readable]

Plan to introduce a change in non-ESP ABFS clusters, which restricts non-Hadoop group users from running Hadoop commands for storage operations. This change is to improve cluster security posture. Customers need to plan for the updates before September 30, 2023.

Potential benefits: This change is to improve cluster security posture

For More information, see Azure HDInsight release notes

Restart brokers on your Kafka Cluster Disks

When data disks used by Kafka brokers in HDInsight clusters are almost full, the Apache Kafka broker process can't start and fails. To mitigate, find the retention time for every topic, back up the files that are older, and restart the brokers.

Potential benefits: Avoid Kafka broker issues

For More information, see Scenario: Brokers are unhealthy or can't restart due to disk space full issue

Cluster Name length update

The max length of cluster name will be changed to 45 from 59 characters, to improve the security posture of clusters. This change will be implemented by September 30th, 2023.

Potential benefits: Security posture improvement for HDInsight

For More information, see Azure HDInsight release notes

Upgrade your cluster to the the latest HDInsight image

A cluster created one year ago doesn't have the latest image upgrades. Your cluster was created 1 year ago. As part of the best practices, we recommend you use the latest HDInsight images for the best open source updates, Azure updates, and security fixes. The recommended maximum duration for cluster upgrades is less than six months.

Potential benefits: Get the latest fixes and features

For More information, see Consider the below points before starting to create a cluster.

Upgrade your HDInsight Cluster

A cluster not using the latest image doesn't have the latest upgrades. Your cluster isn't using the latest image. We recommend you use the latest versions of HDInsight images for the best of open source updates, Azure updates, and security fixes. HDInsight releases happen every 30 to 60 days.

Potential benefits: Get the latest fixes and features

For More information, see Azure HDInsight release notes

Gateway or virtual machine not reachable

We have detected a Network prob failure, it indicates unreachable gateway or a virtual machine. Verify all cluster hosts’ availability. Restart virtual machine to recover. If you need further assistance, don't hesitate to contact Azure support for help.

Potential benefits: Improved availability

VM agent is 9.9.9.9. Upgrade the cluster.

Our records indicate that one or more of your clusters are using images dated February 2022 or older (image versions 2202xxxxxx or older). There is a potential reliability issue on HDInsight clusters that use images dated February 2022 or older.Consider rebuilding your clusters with latest image.

Potential benefits: Improved Reliability in Scaling and Network connectivity

Media Services

Increase Media Services quotas or limits

When a media account hits its quota limits, disruption of service might occur. To avoid any disruption of service, review current usage of assets, content key policies, and stream policies and increase quota limits for the entities that are close to hitting the limit. You can request quota limits be increased by opening a ticket and adding relevant details. TIP: Don't create additional Azure Media accounts in an attempt to obtain higher limits.

Potential benefits: Avoid any disruption to service due to customer exceeding quota limits.

For More information, see Azure Media Services quotas and limits

Service Bus

Use Service Bus premium tier for improved resilience

When running critical applications, the Service Bus premium tier offers better resource isolation at the CPU and memory level, enhancing availability. It also supports Geo-disaster recovery feature enabling easier recovery from regional disasters without having to change application configurations.

Potential benefits: Service Bus premium tier offers better resiliency with CPU and memory resource isolation as well as Geo-disaster recovery

For More information, see Service Bus premium messaging tier

Use Service Bus autoscaling feature in the premium tier for improved resilience

When running critical applications, enabling the auto scale feature allows you to have enough capacity to handle the load on your application. Having the right amount of resources running can reduce throttling and provide a better user experience.

Potential benefits: Enabling autoscale prevents users from capacity constraints

For More information, see Automatically update messaging units of an Azure Service Bus namespace

SQL Server on Azure Virtual Machines

Enable Azure backup for SQL on your virtual machines

For the benefits of zero-infrastructure backup, point-in-time restore, and central management with SQL AG integration, enable backups for SQL databases on your virtual machines using Azure backup.

Potential benefits: SQL aware backups with no-infra for backup, centralized management, AG integration and point-in-time restore

For More information, see About SQL Server Backup in Azure VMs

Storage

Use Managed Disks for storage accounts reaching capacity limit

When Premium SSD unmanaged disks in storage accounts are about to reach their Premium Storage capacity limit, failures might occur. To avoid failures when this limit is reached, migrate to Managed Disks that don't have an account capacity limit. This migration can be done through the portal in less than 5 minutes.

Potential benefits: Avoid scale issues when account reaches capacity limit

For More information, see Scalability and performance targets for standard storage accounts

Configure blob backup

Azure blob backup helps protect data from accidental or malicious deletion. We recommend that you configure blob backup.

Potential benefits: Protect data from accidental or malicious deletion

For More information, see Overview of Azure Blob backup

Subscriptions

Turn on Azure Backup to get simple, reliable, and cost-effective protection for your data

Keep your information and applications safe with robust, one click backup from Azure. Activate Azure Backup to get cost-effective protection for a wide range of workloads including VMs, SQL databases, applications, and file shares.

Potential benefits: Ensure your business-critical applications stay protected

For More information, see Azure Backup Documentation - Azure Backup

Create an Azure Service Health alert

Azure Service Health alerts keep you informed about issues and advisories in four areas (Service issues, Planned maintenance, Security and Health advisories). These alerts are personalized to notify you about disruptions or potential impacts on your chosen Azure regions and services.

Potential benefits: Stay informed about issues and advisories across 4 areas (Service issues, Planned maintenance, Security advisories and Health advisories)

For More information, see Create activity log alerts on service notifications using the Azure portal

Virtual Machines

Improve data reliability by using Managed Disks

Virtual machines in an Availability Set with disks that share either storage accounts or storage scale units aren't resilient to single storage scale unit failures during outages. Migrate to Azure Managed Disks to ensure that the disks of different VMs in the Availability Set are sufficiently isolated to avoid a single point of failure.

Potential benefits: Ensure business continuity through data resilience

For More information, see https://aka.ms/aa_avset_manageddisk_learnmore

Enable virtual machine replication to protect your applications from regional outage

Virtual machines are resilient to regional outages when replication to another region is enabled. To reduce adverse business impact during an Azure region outage, we recommend enabling replication of all business-critical virtual machines.

Potential benefits: Ensure business continuity in case of any Azure region outage

For More information, see Quickstart: Set up disaster recovery to a secondary Azure region for an Azure VM

Update your outbound connectivity protocol to Service Tags for Azure Site Recovery

IP address-based allowlisting is a vulnerable way to control outbound connectivity for firewalls, Service Tags are a good alternative. We highly recommend the use of Service Tags, to allow connectivity to Azure Site Recovery services for the machines.

Potential benefits: Ensures better security, stability and resiliency than hard coded IP Addresses

For More information, see About networking in Azure VM disaster recovery

Upgrade the standard disks attached to your premium-capable VM to premium disks

Using Standard SSD disks with premium VMs may lead to suboptimal performance and latency issues. We recommend that you consider upgrading the standard disks to premium disks. For any Single Instance Virtual Machine using premium storage for all Operating System Disks and Data Disks, we guarantee Virtual Machine Connectivity of at least 99.9%. When choosing to upgrade, there are two factors to consider. The first factor is that upgrading requires a VM reboot and that takes 3-5 minutes to complete. The second is if the VMs in the list are mission-critical production VMs, evaluate the improved availability against the cost of premium disks.

Potential benefits: Improved availability with single VM SLA available only when all disks are premium

For More information, see Azure managed disk types

Upgrade VM from Premium Unmanaged Disks to Managed Disks at no additional cost

Azure Managed Disks provide higher resiliency, simplified service management, higher scale target and more choices among several disk types. Your VM is using premium unmanaged disks that can be migrated to managed disks at no additional cost through the portal in less than 5 minutes.

Potential benefits: Leverage higher resiliency and other benefits of Managed Disks

For More information, see Introduction to Azure managed disks

Upgrade your deprecated Virtual Machine image to a newer image

Virtual Machines (VMs) in your subscription are running on images scheduled for deprecation. Once the image is deprecated, new VMs can't be created from the deprecated image. To prevent disruption to your workloads, upgrade to a newer image. (VMRunningDeprecatedImage)

Potential benefits: Minimize any potential disruptions to your VM workloads

For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines

Upgrade to a newer offer of Virtual Machine image

Virtual Machines (VMs) in your subscription are running on images scheduled for deprecation. Once the image is deprecated, new VMs can't be created from the deprecated image. To prevent disruption to your workloads, upgrade to a newer image. (VMRunningDeprecatedOfferLevelImage)

Potential benefits: Minimize any potential disruptions to your VM workloads

For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines

Upgrade to a newer SKU of Virtual Machine image

Virtual Machines (VMs) in your subscription are running on images scheduled for deprecation. Once the image is deprecated, new VMs can't be created from the deprecated image. To prevent disruption to your workloads, upgrade to a newer image.

Potential benefits: Minimize any potential disruptions to your VM workloads

For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines

Upgrade your Virtual Machine Scale Set to alternative image version

VMSS in your subscription are running on images that have been scheduled for deprecation. Once the image is deprecated, your Virtual Machine Scale Set workloads would no longer scale out. Upgrade to newer version of the image to prevent disruption to your workload.

Potential benefits: Minimize any potential disruptions to your Virtual Machine Scale Set workloads

For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines

Upgrade your Virtual Machine Scale Set to alternative image offer

VMSS in your subscription are running on images that have been scheduled for deprecation. Once the image is deprecated, your Virtual Machine Scale Set workloads would no longer scale out. To prevent disruption to your workload, upgrade to newer offer of the image.

Potential benefits: Minimize any potential disruptions to your Virtual Machine Scale Set workloads

For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines

Upgrade your Virtual Machine Scale Set to alternative image SKU

VMSS in your subscription are running on images that have been scheduled for deprecation. Once the image is deprecated, your Virtual Machine Scale Set workloads would no longer scale out. To prevent disruption to your workload, upgrade to newer SKU of the image.

Potential benefits: Minimize any potential disruptions to your Virtual Machine Scale Set workloads

For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines

Provide access to mandatory URLs missing for your Azure Virtual Desktop environment

For a session host to deploy and register to Windows Virtual Desktop (WVD) properly, you need a set of URLs in the 'allowed list' in case your VM runs in a restricted environment. For specific URLs missing from your allowed list, search your application event log for event 3702.

Potential benefits: Ensure successful deployment and session host functionality when using Windows Virtual Desktop service

For More information, see Required FQDNs and endpoints for Azure Virtual Desktop

Align location of resource and resource group

To reduce the impact of region outages, co-locate your resources with their resource group in the same region. This way, Azure Resource Manager stores metadata related to all resources within the group in one region. By co-locating, you reduce the chance of being affected by region unavailability.

Potential benefits: Reduce write failures due to region outages

For More information, see What is Azure Resource Manager?

Use Availability zones for better resiliency and availability

Availability Zones (AZ) in Azure help protect your applications and data from datacenter failures. Each AZ is made up of one or more datacenters equipped with independent power, cooling, and networking. By designing solutions to use zonal VMs, you can isolate your VMs from failure in any other zone.

Potential benefits: Usage of zonal VMs protect your apps from zonal outage in any other zones.

Enable Azure Virtual Machine Scale Set (VMSS) application health monitoring

Configuring Virtual Machine Scale Set application health monitoring using the Application Health extension or load balancer health probes enables the Azure platform to improve the resiliency of your application by responding to changes in application health.

Potential benefits: Increase resiliency by exposing application health to Azure

For More information, see Using Application Health extension with Virtual Machine Scale Sets

Enable Backups on your Virtual Machines

Secure your data by enabling backups for your virtual machines.

Potential benefits: Protection of your Virtual Machines

For More information, see What is the Azure Backup service?

Enable automatic repair policy on Azure Virtual Machine Scale Sets (VMSS)

Enabling automatic instance repairs helps achieve high availability by maintaining a set of healthy instances. If an unhealthy instance is found by the Application Health extension or load balancer health probe, automatic instance repairs attempt to recover the instance by triggering repair actions.

Potential benefits: Increase resiliency by automating repair of failed instances

For More information, see Automatic instance repairs for Azure Virtual Machine Scale Sets

Configure Virtual Machine Scale Set automated scaling by metrics

Optimize resource utilization, reduce costs, and enhance application performance with custom autoscale based on a metric. Automatically add Virtual Machine instances based on real-time metrics such as CPU, memory, and disk operations. Ensure high availability while maintaining cost-efficiency.

Potential benefits: Ensures high availability while maintaining cost-efficiency

For More information, see Overview of autoscale with Azure Virtual Machine Scale Sets

Use Azure Disks with Zone Redundant Storage (ZRS) for higher resiliency and availability

Azure Disks with ZRS provide synchronous replication of data across three Availability Zones in a region, making the disk tolerant to zonal failures without disruptions to applications. For higher resiliency and availability, migrate disks from LRS to ZRS.

Potential benefits: By designing your applications to use ZRS Disks, your data is replicated across 3 Availability Zones, making your disk resilient to a zonal outage

For More information, see Convert a disk from LRS to ZRS

DNS Servers should be configured at the Virtual Network level

Set the DNS Servers for the VM at the Virtual Network level to ensure consistency throughout the environment. In the configuration of the primary network interface, DNS Servers setting should be set to Inherit from virtual network.

Potential benefits: Ensures consistency and reliable name resolution

For More information, see Name resolution for resources in Azure virtual networks

Migrate to Virtual Machine Scale Sets Flex

Migrate workloads from virtual machine (VM) to Virtual Machine Scale Sets Flex for deployment across zones or within the same zone across different fault domains. The platform plans to deprecate availability sets.

Potential benefits: Availability across zones or across different fault domains

For More information, see Migrate deployments and resources to Virtual Machine Scale Sets in Flexible orchestration

Workloads

Configure an Always On availability group for Multi-purpose SQL servers (MPSQL)

MPSQL servers with an Always On availability group have better availability. Your MPSQL servers aren't configured as part of an Always On availability group in the shared infrastructure in your Epic system. Always On availability groups improve database availability and resource use.

Potential benefits: Improved Database availability and resource use

For More information, see What is an Always On availability group?

Configure Local host cache on Citrix VDI servers to ensure seamless connection brokering operations

We have observed that your Citrix VDI servers aren't configured Local host Cache. Local Host Cache (LHC) is a feature in Citrix Virtual Apps and Desktops that allows connection brokering operations to continue when an outage occurs.LHC engages when the site database is inaccessible for 90 seconds.

Potential benefits: Seamless connection brokering operations

Deploy Hyperspace Web servers as part of a Virtual Machine Scale Set Flex configured for 3 zones

We have observed that your Hyperspace Web servers in the Virtual Machine Scale Set Flex set up aren't spread across 3 zones in the selected region. For services like Hyperspace Web in Epic systems that require high availability and large scale, it's recommended that servers are deployed as part of Virtual Machine Scale Set Flex and spread across 3 zones. With Flexible orchestration, Azure provides a unified experience across the Azure VM ecosystem

Potential benefits: High availability and on-demand large scale for Hyperspace web servers in Epic DB

For More information, see Create a Virtual Machine Scale Set that uses Availability Zones

Set the Idle timeout in Azure Load Balancer to 30 minutes for ASCS HA setup in SAP workloads

To prevent load balancer timeout, make sure that all Azure Load Balancing Rules have: 'Idle timeout (minutes)' set to the maximum value of 30 minutes. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the setting.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Enable Floating IP in the Azure Load balancer for ASCS HA setup in SAP workloads

For port resuse and better high availability, enable floating IP in the load balancing rules for the Azure Load Balancer for HA set up of ASCS instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Enable HA ports in the Azure Load Balancer for ASCS HA setup in SAP workloads

For port resuse and better high availability, enable HA ports in the load balancing rules for HA set up of ASCS instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Disable TCP timestamps on VMs placed behind Azure Load Balancer in ASCS HA setup in SAP workloads

Disable TCP timestamps on VMs placed behind AzurEnabling TCP timestamps will cause the health probes to fail due to TCP packets being dropped by the VM's guest OS TCP stack causing the load balancer to mark the endpoint as down

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see https://launchpad.support.sap.com/#/notes/2382421

Set the Idle timeout in Azure Load Balancer to 30 minutes for HANA DB HA setup in SAP workloads

To prevent load balancer timeout, ensure that all Azure Load Balancing Rules 'Idle timeout (minutes)' parameter is set to the maximum value of 30 minutes. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the recommended settings.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Enable Floating IP in the Azure Load balancer for HANA DB HA setup in SAP workloads

For more flexible routing, enable floating IP in the load balancing rules for the Azure Load Balancer for HA set up of HANA DB instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the recommended settings.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Enable HA ports in the Azure Load Balancer for HANA DB HA setup in SAP workloads

For enhanced scalability, enable HA ports in the Load balancing rules for HA set up of HANA DB instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the recommended settings.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Disable TCP timestamps on VMs placed behind Azure Load Balancer in HANA DB HA setup in SAP workloads

Disable TCP timestamps on VMs placed behind Azure Load Balancer. Enabling TCP timestamps causes the health probes to fail due to TCP packets dropped by the VM's guest OS TCP stack causing the load balancer to mark the endpoint as down.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see Azure Load Balancer health probes

Ensure that stonith is enabled for the Pacemaker configuration in ASCS HA setup in SAP workloads

In a Pacemaker cluster, the implementation of node level fencing is done using a STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Set the corosync token in Pacemaker cluster to 30000 for ASCS HA setup in SAP workloads (RHEL)

The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to 30000 for SAP on Azure.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Set the expected votes parameter to '2' in Pacemaker cofiguration in ASCS HA setup in SAP workloads (RHEL)

For a two node HA cluster, set the quorum 'expected-votes' parameter to '2' as recommended for SAP on Azure to ensure a proper quorum, resilience, and data consistency.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Enable the 'concurrent-fencing' parameter in Pacemaker cofiguration in ASCS HA setup in SAP workloads (ConcurrentFencingHAASCSRH)

Concurrent fencing enables the fencing operations to be performed in parallel, which enhances high availability (HA), prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in the Pacemaker cluster configuration for ASCS HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Ensure that stonith is enabled for the cluster configuration in ASCS HA setup in SAP workloads

In a Pacemaker cluster, the implementation of node level fencing is done using a STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the stonith timeout to 144 for the cluster configuration in ASCS HA setup in SAP workloads

The ‘stonith-timeout’ specifies how long the cluster waits for a STONITH action to complete. Setting it to '144' seconds allows more time for fencing actions to complete. We recommend this setting for HA clusters for SAP on Azure.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the corosync token in Pacemaker cluster to 30000 for ASCS HA setup in SAP workloads (SUSE)

The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to '30000' for SAP on Azure.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set 'token_retransmits_before_loss_const' to 10 in Pacemaker cluster in ASCS HA setup in SAP workloads

The corosync token_retransmits_before_loss_const determines how many token retransmits are attempted before timeout in HA clusters. For stability and reliability, set the 'totem.token_retransmits_before_loss_const' to '10' for ASCS HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

The 'corosync join' timeout specifies in milliseconds how long to wait for join messages in the membership protocol so when a new node joins the cluster, it has time to synchronize its state with existing nodes. Set to '60' in Pacemaker cluster configuration for ASCS HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the 'corosync consensus' in Pacemaker cluster to '36000' for ASCS HA setup in SAP workloads

The corosync 'consensus' parameter specifies in milliseconds how long to wait for consensus before starting a round of membership in the cluster configuration. Set 'consensus' in the Pacemaker cluster configuration for ASCS HA setup to 1.2 times the corosync token for reliable failover behavior.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the 'corosync max_messages' in Pacemaker cluster to '20' for ASCS HA setup in SAP workloads

The corosync 'max_messages' constant specifies the maximum number of messages that one processor can send on receipt of the token. Set it to 20 times the corosync token parameter in the Pacemaker cluster configuration to allow efficient communication without overwhelming the network.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set 'expected votes' to '2' in the cluster configuration in ASCS HA setup in SAP workloads (SUSE)

For a two node HA cluster, set the quorum 'expected_votes' parameter to 2 as recommended for SAP on Azure to ensure a proper quorum, resilience, and data consistency.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the two_node parameter to 1 in the cluster cofiguration in ASCS HA setup in SAP workloads

For a two node HA cluster, set the quorum parameter 'two_node' to 1 as recommended for SAP on Azure.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Enable 'concurrent-fencing' in Pacemaker ASCS HA setup in SAP workloads (ConcurrentFencingHAASCSSLE)

Concurrent fencing enables the fencing operations to be performed in parallel, which enhances HA, prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in the Pacemaker cluster configuration for ASCS HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Ensure the number of 'fence_azure_arm' instances is one in Pacemaker in HA enabled SAP workloads

If you're using Azure fence agent for fencing with either managed identity or service principal, ensure that there's one instance of fence_azure_arm (an I/O fencing agent for Azure Resource Manager) in the Pacemaker configuration for ASCS HA setup for high availability.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set stonith-timeout to 900 in Pacemaker configuration with Azure fence agent for ASCS HA setup

For reliable function of the Pacemaker for ASCS HA set the 'stonith-timeout' to 900. This setting is applicable if you're using the Azure fence agent for fencing with either managed identity or service principal.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Create the softdog config file in Pacemaker configuration for ASCS HA setup in SAP workloads

The softdog timer is loaded as a kernel module in linux OS. This timer triggers a system reset if it detects that the system has hung. Ensure that the softdog configuation file is created in the Pacemaker cluster forASCS HA set up

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Ensure the softdog module is loaded in for Pacemaler in ASCS HA setup in SAP workloads

The softdog timer is loaded as a kernel module in linux OS. This timer triggers a system reset if it detects that the system has hung. First ensure that you created the softdog configuration file, then load the softdog module in the Pacemaker configuration for ASCS HA setup

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set PREFER_SITE_TAKEOVER parameter to 'true' in the Pacemaker configuration for HANA DB HA setup

The PREFER_SITE_TAKEOVER parameter in SAP HANA defines if the HANA system replication (SR) resource agent prefers to takeover the secondary instance instead of restarting the failed primary locally. For reliable function of HANA DB high availability (HA) setup, set PREFER_SITE_TAKEOVER to 'true'.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Enable stonith in the cluster cofiguration in HA enabled SAP workloads for VMs with Redhat OS

In a Pacemaker cluster, the implementation of node level fencing is done using STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration of your SAP workload.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Set the corosync token in Pacemaker cluster to 30000 for HA enabled HANA DB for VM with RHEL OS

The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to 30000 for SAP on Azure with Redhat OS.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Set the expected votes parameter to '2' in HA enabled SAP workloads (RHEL)

For a two node HA cluster, set the quorum votes to '2' as recommended for SAP on Azure to ensure a proper quorum, resilience, and data consistency.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Enable the 'concurrent-fencing' parameter in the Pacemaker cofiguration for HANA DB HA setup

Concurrent fencing enables the fencing operations to be performed in parallel, which enhances high availability (HA), prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in the Pacemaker cluster configuration for HANA DB HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux

Set parameter PREFER_SITE_TAKEOVER to 'true' in the cluster cofiguration in HA enabled SAP workloads

The PREFER_SITE_TAKEOVER parameter in SAP HANA topology defines if the HANA SR resource agent prefers to takeover the secondary instance instead of restarting the failed primary locally. For reliable function of HANA DB HA setup, set it to 'true'.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Enable stonith in the cluster configuration in HA enabled SAP workloads for VMs with SUSE OS

In a Pacemaker cluster, the implementation of node level fencing is done using STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the stonith timeout to 144 for the cluster configuration in HA enabled SAP workloads

The ‘stonith-timeout’ specifies how long the cluster waits for a STONITH action to complete. Setting it to '144' seconds allows more time for fencing actions to complete. We recommend this setting for HA clusters for SAP on Azure.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the corosync token in Pacemaker cluster to 30000 for HA enabled HANA DB for VM with SUSE OS

The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to 30000 for HA enabled HANA DB for VM with SUSE OS.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set 'token_retransmits_before_loss_const' to 10 in Pacemaker cluster in HA enabled SAP workloads

The corosync token_retransmits_before_loss_const determines how many token retransmits are attempted before timeout in HA clusters. Set the totem.token_retransmits_before_loss_const to 10 as recommended for HANA DB HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the 'corosync join' in Pacemaker cluster to 60 for HA enabled HANA DB in SAP workloads

The 'corosync join' timeout specifies in milliseconds how long to wait for join messages in the membership protocol so when a new node joins the cluster, it has time to synchronize its state with existing nodes. Set to '60' in Pacemaker cluster configuration for HANA DB HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the 'corosync consensus' in Pacemaker cluster to 36000 for HA enabled HANA DB in SAP workloads

The corosync 'consensus' parameter specifies in milliseconds how long to wait for consensus before starting a new round of membership in the cluster. For reliable failover behavior, set 'consensus' in the Pacemaker cluster configuration for HANA DB HA setup to 1.2 times the corosync token.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the 'corosync max_messages' in Pacemaker cluster to 20 for HA enabled HANA DB in SAP workloads

The corosync 'max_messages' constant specifies the maximum number of messages that one processor can send on receipt of the token. To allow efficient communication without overwhelming the network, set it to 20 times the corosync token parameter in the Pacemaker cluster configuration.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the expected votes parameter to 2 in HA enabled SAP workloads (SUSE)

Set the expected votes parameter to '2' in the cluster configuration in HA enabled SAP workloads to ensure a proper quorum, resilience, and data consistency.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set the two_node parameter to 1 in the cluster configuration in HA enabled SAP workloads

For a two node HA cluster, set the quorum parameter 'two_node' to 1 as recommended for SAP on Azure.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Enable the 'concurrent-fencing' parameter in the cluster configuration in HA enabled SAP workloads

Concurrent fencing enables the fencing operations to be performed in parallel, which enhances HA, prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in HA enabled SAP workloads.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Ensure there is one instance of fence_azure_arm in the Pacemaker configuration for HANA DB HA setup

If you're using Azure fence agent for fencing with either managed identity or service principal, ensure that one instance of fence_azure_arm (an I/O fencing agent for Azure Resource Manager) is in the Pacemaker configuration for HANA DB HA setup for high availability.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Set stonith-timeout to 900 in Pacemaker configuration with Azure fence agent for HANA DB HA setup

If you're using the Azure fence agent for fencing with either managed identity or service principal, ensure reliable function of the Pacemaker for HANA DB HA setup, by setting the 'stonith-timeout' to 900.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Ensure that the softdog config file is in the Pacemaker configuration for HANA DB in SAP workloads

The softdog timer is loaded as a kernel module in Linux OS. This timer triggers a system reset if it detects that the system is hung. Ensure that the softdog configuration file is created in the Pacemaker cluster for HANA DB HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Ensure the softdog module is loaded in Pacemaker in ASCS HA setup in SAP workloads

The softdog timer is loaded as a kernel module in Linux OS. This timer triggers a system reset if it detects that the system is hung. First ensure that you created the softdog configuration file, then load the softdog module in the Pacemaker configuration for HANA DB HA setup.

Potential benefits: Reliability of HA setup in SAP workloads

For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server

Next steps

Learn more about Reliability - Microsoft Azure Well Architected Framework