Monitor Azure Batch
This article describes:
- The types of monitoring data you can collect for this service.
- How to analyze that data.
Note
If you're already familiar with this service and/or Azure Monitor and just want to know how to analyze monitoring data, see the Analyze section near the end of this article.
When you have critical applications and business processes that rely on Azure resources, you need to monitor and get alerts for your system. The Azure Monitor service collects and aggregates metrics and logs from every component of your system. Azure Monitor provides you with a view of availability, performance, and resilience, and notifies you of issues. You can use the Azure portal, PowerShell, Azure CLI, REST API, or client libraries to set up and view monitoring data.
- For more information on Azure Monitor, see the Azure Monitor overview.
- For more information on how to monitor Azure resources in general, see Monitor Azure resources with Azure Monitor.
Resource types
Azure uses the concept of resource types and IDs to identify everything in a subscription. Azure Monitor similarly organizes core monitoring data into metrics and logs based on resource types, also called namespaces. Different metrics and logs are available for different resource types. Your service might be associated with more than one resource type.
Resource types are also part of the resource IDs for every resource running in Azure. For example, one resource type for a virtual machine is Microsoft.Compute/virtualMachines
. For a list of services and their associated resource types, see Resource providers.
For more information about the resource types for Batch, see Batch monitoring data reference.
Data storage
For Azure Monitor:
- Metrics data is stored in the Azure Monitor metrics database.
- Log data is stored in the Azure Monitor logs store. Log Analytics is a tool in the Azure portal that can query this store.
- The Azure activity log is a separate store with its own interface in the Azure portal.
- You can optionally route metric and activity log data to the Azure Monitor logs database store so you can query the data and correlate it with other log data using Log Analytics.
For detailed information on how Azure Monitor stores data, see Azure Monitor data platform.
Access diagnostics logs in storage
If you archive Batch diagnostic logs in a storage account, a storage container is created in the storage account as soon as a related event occurs. Blobs are created according to the following naming pattern:
insights-{log category name}/resourceId=/SUBSCRIPTIONS/{subscription ID}/
RESOURCEGROUPS/{resource group name}/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/{Batch account name}/y={four-digit numeric year}/
m={two-digit numeric month}/d={two-digit numeric day}/
h={two-digit 24-hour clock hour}/m=00/PT1H.json
For example:
insights-metrics-pt1m/resourceId=/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/
RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/MYBATCHACCOUNT/y=2018/m=03/d=05/h=22/m=00/PT1H.json
Each PT1H.json blob file contains JSON-formatted events that occurred within the hour specified in the blob URL (for example, h=12
). During the present hour, events are appended to the PT1H.json file as they occur. The minute value (m=00
) is always 00
, since diagnostic log events are broken into individual blobs per hour. All times are in UTC.
The following example shows a PoolResizeCompleteEvent
entry in a PT1H.json log file. The entry includes information about the current and target number of dedicated and low-priority nodes and the start and end time of the operation.
{ "Tenant": "65298bc2729a4c93b11c00ad7e660501", "time": "2019-08-22T20:59:13.5698778Z", "resourceId": "/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/BATCHACCOUNTS/MYBATCHACCOUNT/", "category": "ServiceLog", "operationName": "PoolResizeCompleteEvent", "operationVersion": "2017-06-01", "properties": {"id":"MYPOOLID","nodeDeallocationOption":"Requeue","currentDedicatedNodes":10,"targetDedicatedNodes":100,"currentLowPriorityNodes":0,"targetLowPriorityNodes":0,"enableAutoScale":false,"isAutoPool":false,"startTime":"2019-08-22 20:50:59.522","endTime":"2019-08-22 20:59:12.489","resultCode":"Success","resultMessage":"The operation succeeded"}}
To access the logs in your storage account programmatically, use the Storage APIs.
Azure Monitor platform metrics
Azure Monitor provides platform metrics for most services. These metrics are:
- Individually defined for each namespace.
- Stored in the Azure Monitor time-series metrics database.
- Lightweight and capable of supporting near real-time alerting.
- Used to track the performance of a resource over time.
Collection: Azure Monitor collects platform metrics automatically. No configuration is required.
Routing: You can also usually route platform metrics to Azure Monitor logs / Log Analytics so you can query them with other log data. For more information, see the Metrics diagnostic setting. For how to configure diagnostic settings for a service, see Create diagnostic settings in Azure Monitor.
For a list of all metrics it's possible to gather for all resources in Azure Monitor, see Supported metrics in Azure Monitor.
Examples of metrics in a Batch account are Pool Create Events, Low-Priority Node Count, and Task Complete Events. These metrics can help identify trends and can be used for data analysis.
Note
Metrics emitted in the last 3 minutes might still be aggregating, so values might be underreported during this time frame. Metric delivery isn't guaranteed and might be affected by out-of-order delivery, data loss, or duplication.
For a complete list of available metrics for Batch, see Batch monitoring data reference.
Azure Monitor resource logs
Resource logs provide insight into operations that were done by an Azure resource. Logs are generated automatically, but you must route them to Azure Monitor logs to save or query them. Logs are organized by category. A given namespace might have multiple resource log categories.
Collection: Resource logs aren't collected and stored until you create a diagnostic setting and route the logs to one or more locations. When you create a diagnostic setting, you specify which categories of logs to collect. There are multiple ways to create and maintain diagnostic settings, including the Azure portal, programmatically, and though Azure Policy.
Routing: The suggested default is to route resource logs to Azure Monitor Logs so you can query them with other log data. Other locations such as Azure Storage, Azure Event Hubs, and certain Azure monitoring partners are also available. For more information, see Azure resource logs and Resource log destinations.
For detailed information about collecting, storing, and routing resource logs, see Diagnostic settings in Azure Monitor.
For a list of all available resource log categories in Azure Monitor, see Supported resource logs in Azure Monitor.
All resource logs in Azure Monitor have the same header fields, followed by service-specific fields. The common schema is outlined in Azure Monitor resource log schema.
For the available resource log categories, their associated Log Analytics tables, and the logs schemas for Batch, see Batch monitoring data reference.
You must explicitly enable diagnostic settings for each Batch account you want to monitor.
For the Batch service, you can collect the following logs:
- ServiceLog: Events emitted by the Batch service during the lifetime of an individual resource such as a pool or task.
- AllMetrics: Metrics at the Batch account level.
The following screenshot shows an example diagnostic setting that sends allLogs and AllMetrics to a Log Analytics workspace.
When you create an Azure Batch pool, you can install any of the following monitoring-related extensions on the compute nodes to collect and analyze data:
- Azure Monitor agent for Linux
- Azure Monitor agent for Windows
- Azure Diagnostics extension for Windows VMs
- Azure Monitor Logs analytics and monitoring extension for Linux
- Azure Monitor Logs analytics and monitoring extension for Windows
For a comparison of the different extensions and agents and the data they collect, see Compare agents.
Azure activity log
The activity log contains subscription-level events that track operations for each Azure resource as seen from outside that resource; for example, creating a new resource or starting a virtual machine.
Collection: Activity log events are automatically generated and collected in a separate store for viewing in the Azure portal.
Routing: You can send activity log data to Azure Monitor Logs so you can analyze it alongside other log data. Other locations such as Azure Storage, Azure Event Hubs, and certain Azure monitoring partners are also available. For more information on how to route the activity log, see Overview of the Azure activity log.
For Batch accounts specifically, the activity log collects events related to account creation and deletion and key management.
Analyze monitoring data
There are many tools for analyzing monitoring data.
Azure Monitor tools
Azure Monitor supports the following basic tools:
Metrics explorer, a tool in the Azure portal that allows you to view and analyze metrics for Azure resources. For more information, see Analyze metrics with Azure Monitor metrics explorer.
Log Analytics, a tool in the Azure portal that allows you to query and analyze log data by using the Kusto query language (KQL). For more information, see Get started with log queries in Azure Monitor.
The activity log, which has a user interface in the Azure portal for viewing and basic searches. To do more in-depth analysis, you have to route the data to Azure Monitor logs and run more complex queries in Log Analytics.
Tools that allow more complex visualization include:
- Dashboards that let you combine different kinds of data into a single pane in the Azure portal.
- Workbooks, customizable reports that you can create in the Azure portal. Workbooks can include text, metrics, and log queries.
- Power BI, a business analytics service that provides interactive visualizations across various data sources. You can configure Power BI to automatically import log data from Azure Monitor to take advantage of these visualizations.
When you analyze count-based Batch metrics like Dedicated Core Count, use the Avg aggregation. For event-based metrics like Pool Resize Complete Events, use the Count aggregation. Avoid using the Sum aggregation, which adds up the values of all data points received over the period of the chart.
Azure Monitor export tools
You can get data out of Azure Monitor into other tools by using the following methods:
Metrics: Use the REST API for metrics to extract metric data from the Azure Monitor metrics database. The API supports filter expressions to refine the data retrieved. For more information, see Azure Monitor REST API reference.
Logs: Use the REST API or the associated client libraries.
To get started with the REST API for Azure Monitor, see Azure monitoring REST API walkthrough.
Kusto queries
You can analyze monitoring data in the Azure Monitor Logs / Log Analytics store by using the Kusto query language (KQL).
Important
When you select Logs from the service's menu in the portal, Log Analytics opens with the query scope set to the current service. This scope means that log queries will only include data from that type of resource. If you want to run a query that includes data from other Azure services, select Logs from the Azure Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics for details.
For a list of common queries for any service, see the Log Analytics queries interface.
Sample queries
Here are a few sample log queries for Batch:
Pool resizes: Lists resize times by pool and result code (success or failure):
AzureDiagnostics
| where OperationName=="PoolResizeCompleteEvent"
| summarize operationTimes=make_list(startTime_s) by poolName=id_s, resultCode=resultCode_s
Task durations: Gives the elapsed time of tasks in seconds, from task start to task complete.
AzureDiagnostics
| where OperationName=="TaskCompleteEvent"
| extend taskId=id_s, ElapsedTime=datetime_diff('second', executionInfo_endTime_t, executionInfo_startTime_t) // For longer running tasks, consider changing 'second' to 'minute' or 'hour'
| summarize taskList=make_list(taskId) by ElapsedTime
Failed tasks per job: Lists failed tasks by parent job.
AzureDiagnostics
| where OperationName=="TaskFailEvent"
| summarize failedTaskList=make_list(id_s) by jobId=jobId_s, ResourceId
Alerts
Azure Monitor alerts proactively notify you when specific conditions are found in your monitoring data. Alerts allow you to identify and address issues in your system before your customers notice them. For more information, see Azure Monitor alerts.
There are many sources of common alerts for Azure resources. For examples of common alerts for Azure resources, see Sample log alert queries. The Azure Monitor Baseline Alerts (AMBA) site provides key alert metrics, dashboards, and guidelines for Azure Landing Zone (ALZ) scenarios.
The common alert schema standardizes the consumption of Azure Monitor alert notifications. For more information, see Common alert schema.
Types of alerts
You can alert on any metric or log data source in the Azure Monitor data platform. There are many different types of alerts depending on the services you're monitoring and the monitoring data you're collecting. Different types of alerts have various benefits and drawbacks. For more information, see Choose the right monitoring alert type.
The following list describes the types of Azure Monitor alerts you can create:
- Metric alerts evaluate resource metrics at regular intervals. Metrics can be platform metrics, custom metrics, logs from Azure Monitor converted to metrics, or Application Insights metrics. Metric alerts can also apply multiple conditions and dynamic thresholds.
- Log alerts allow users to use a Log Analytics query to evaluate resource logs at a predefined frequency.
- Activity log alerts trigger when a new activity log event occurs that matches defined conditions. Resource Health alerts and Service Health alerts are activity log alerts that report on your service and resource health.
You can also create the following types of alerts for some Azure services:
- Smart detection alerts on an Application Insights resource automatically warn you of potential performance problems and failure anomalies in your web application. You can migrate smart detection on your Application Insights resource to create alert rules for the different smart detection modules.
- Prometheus alerts alert on Prometheus metrics stored in Azure Monitor managed services for Prometheus . The alert rules are based on the PromQL open-source query language. Your service may not support this type of alert. Currently, Prometheus is used on a limited set of services with a guest operating system, such as Azure Virtual Machine and Azure Container Instances.
- Recommended alert rules are available out-of-box for some Azure resources, including virtual machines, Azure Kubernetes Service (AKS) resources, and Log Analytics workspaces.
Monitor multiple resources
You can monitor at scale by applying the same metric alert rule to multiple resources of the same type that exist in the same Azure region. Individual notifications are sent for each monitored resource. For supported Azure services and clouds, see Monitor multiple resources with one alert rule.
Note
If you're creating or running an application that runs on your service, Azure Monitor application insights might offer more types of alerts.
Batch alert rules
Because metric delivery can be subject to inconsistencies such as out-of-order delivery, data loss, or duplication, you should avoid alerts that trigger on a single data point. Instead, use thresholds to account for these inconsistencies over a period of time.
The following table lists some alert rule triggers for Batch. These alert rules are just examples. You can set alerts for any metric, log entry, or activity log entry listed in the Batch monitoring data reference.
Alert type | Condition | Description |
---|---|---|
Metric | Unusable node count | Whenever the Unusable Node Count is greater than 0 |
Metric | Task Fail Events | Whenever the total Task Fail Events is greater than dynamic threshold |
Advisor recommendations
If critical conditions or imminent changes occur during resource operations, an alert displays on the Overview page in the portal.
You can find more information and recommended fixes for the alert in Advisor recommendations under Monitoring. During normal operations, no advisor recommendations display.
For more information on Azure Advisor, see Azure Advisor overview.
Other Batch monitoring options
Batch Explorer is a free, rich-featured, standalone client tool to help create, debug, and monitor Azure Batch applications. You can use Azure Batch Insights with Batch Explorer to get system statistics for your Batch nodes, such as virtual machine (VM) performance counters.
In your Batch applications, you can use the Batch .NET library to monitor or query the status of your resources including jobs, tasks, nodes, and pools. For example:
- Monitor the task state.
- Monitor the node state.
- Monitor the pool state.
- Monitor pool usage in the account.
- Count pool nodes by state.
You can use the Batch APIs to create list queries for Batch jobs, tasks, compute nodes, and other resources. For more information about how to filter list queries, see Create queries to list Batch resources efficiently.
Or, instead of potentially time-consuming list queries that return detailed information about large collections of tasks or nodes, you can use the Get Task Counts and List Pool Node Counts operations to get counts for Batch tasks and compute nodes. For more information, see Monitor Batch solutions by counting tasks and nodes by state.
You can integrate Application Insights with your Azure Batch applications to instrument your code with custom metrics and tracing. For a detailed walkthrough of how to add Application Insights to a Batch .NET solution, instrument application code, monitor the application in the Azure portal, and build custom dashboards, see Monitor and debug an Azure Batch .NET application with Application Insights and accompanying code sample.
Related content
- See Batch monitoring data reference for a reference of the metrics, logs, and other important values created for Batch.
- See Monitoring Azure resources with Azure Monitor for general details on monitoring Azure resources.
- Learn about the Batch APIs and tools available for building Batch solutions.