Automating Azure Instrumentation and Monitoring - Part 1: Introduction
Instrumentation and monitoring is a critical part of managing any application or system. By proactively monitoring the health of the system as a whole, as well as each of its components, we can mitigate potential issues before they affect customers. And if issues do occur, good instrumentation alerts us to that fact so that we can respond quickly.
Azure provides a set of powerful monitoring and instrumentation tools to instrument almost all Azure services as well as our own applications. By taking advantage of these tools we can can improve the quality of our systems. However, there isn't a lot of documentation on how to script and automate the instrumentation components that we build. Alerts, dashboards, and other instrumentation components are important parts of our systems and deserve as much attention as our application code or other parts of our infrastructure. In this series, we'll cover many of the common types of instrumentation used in Azure-hosted systems and will outline how many of these can be automated, usually with a combination of ARM templates and scripting. The series consists of nine parts:
Part 1 (this post) provides an introduction to the series by describing why we should instrument our systems, outlines some of the major tools that Azure provides such as Azure Monitor, and argues why we should be adopting an 'infrastructure as code' mindset for our instrumentation and monitoring components.
Part 2 describes Azure Application Insights, including its proactive detection and alert features. It also outlines a pattern for deploying instrumentation components based on the requirements we might typically have for different environments, from short-lived development and test environments through to production.
Part 3 discusses how to publish custom metrics, both through Application Insights and to Azure Monitor. Custom metrics let us enrich the data that is available to our instrumentation components.
Part 4 covers the basics of alerts and metric alerts. Azure Monitor's powerful alerting system is a big topic, and in this part we'll discuss how it works overall, as well as how to get alerts for built-in and custom metrics.
Part 5 covers log alerts and resource health alerts, two other major types of alerts that Azure Monitor provides. Log alerts let us alert on information coming into Application Insights logs and Log Analytics workspaces, while resource health alerts us when Azure itself is having an issue that may result in downtime or degraded performance.
Part 6 (coming soon) describes dashboards. The Azure Portal has a great dashboard UI, and our instrumentation data can be made available as charts. Dashboards are also possible to automate, and I'll show a few tips and tricks I've learned when doing this.
Part 7 (coming soon) covers availability tests, which let us proactively monitor our web applications for potential outages. We'll discuss deploying and automating both single-step (ping) and multi-step availability tests.
Part 8 (coming soon) describes autoscale. While this isn't exactly instrumentation in and of itself, autoscale is built on much of the same data used to drive alerts and dashboards, and autoscale rules can be automated as well.
Finally, part 9 (coming soon) covers exporting data to other systems. Azure Monitor metrics and log data can be automatically exported, as can Application Insights data, and the export rules can be exported and used from automation scripts.
While the posts will cover the basics of each of these topics, the focus will be on deploying and automating each of these components. I'll provide links to more details on the inner workings where needed to supplement the basic overview I'll provide. Also, I'll assume some basic familiarity with ARM templates and PowerShell.
Let's start by reviewing the landscape of instrumentation on Azure.
Azure's Instrumentation Platform
As Azure has evolved, it's built up an increasingly comprehensive suite of tools for monitoring the individual components of a system as well as complete systems as a whole. The key piece of the Azure monitoring puzzle is named, appropriately enough, Azure Monitor. Azure Monitor is a built-in service that works with almost all Azure services. Many of its features are free. It automatically captures telemetry, consolidates it, and makes the data available for interactive querying as well as for a variety of other purposes that we'll discuss throughout the series.
This isn't quite the whole story, though. While Azure Monitor works well most of the time, and it appears to be the strategic direction that Azure is heading in, there are a number of exceptions, caveats, and complexities - and these become more evident when you try to automate it. I'll cover some of these in more detail below.
Metrics are numeric values that represent a distinct piece of information about a component at a point in time. The exact list of metrics depends on what makes sense for a given service. For example, a virtual machine publishes metrics for the CPU and memory used; a SQL database has metrics for the number of connections and the database throughput units used; a Cosmos DB account publishes metrics for the number of requests issued to the database engine; and an App Service has metrics for the number of requests flowing through. There can be dozens of different metrics published for any given Azure service, and they are all documented for reference. We'll discuss metrics in more detail throughout the series, as there are some important things to be aware of when dealing with metrics.
As well as Azure Monitor's metrics support, some Azure services have their metrics systems. For example, SQL Azure has a large amount of telemetry that can be accessed through dynamic management views. Some of the key metrics are also published into Azure Monitor, but if you want to use metrics that are only available in dynamic management views then you won't be able to use the analysis and processing features of Azure Monitor. We'll discuss a potential workaround for this in part 3 of this series.
A similar example is Azure Storage queues. Azure Storage has an API that can be used to retrieve the approximate number of messages sitting in a queue, but this metric isn't published into Azure Monitor and so isn't available for alerting or dashboarding. Again, we'll discuss a potential workaround for this in part 3 of this series.
Nevertheless, in my experience, almost all of the metrics I work with on a regular basis are published through Azure Monitor, and so in this series we'll predominantly focus on these.
Logs are structured pieces of data, usually with a category, a level, and a textual message, and often with a lot of additional contextual data as well. Broadly speaking, there are several general types of logs that Azure deals with:
Resource activity logs are essentially the logs for management operations performed on Azure resources through the Azure Resource Management (ARM) API, and a few other types of management-related logs. They can be interactively queried using the Azure Portal blades for any resource, as well as resource groups and subscriptions. You can typically view these by looking at the
Activity logtab from any Azure resource blade in the portal. Activity logs contain all write operations that pass through the ARM API. If you use the ARM API directly, or indirectly through the Azure Portal, CLI, PowerShell, or anything else, you'll see logs appear in here. More details on activity logs is available here.
Azure AD activity logs track Active Directory sign-ins and management actions. These can be viewed from within the Azure AD portal blade. We won't be covering Azure AD much in this series, but you can read more detail about Azure AD logs here.
Diagnostic logs are published by individual Azure services. They provide information about the actions and work that the service itself is doing. By default these are not usually available for interactive querying. Diagnostic logs often work quite differently between different services. For example, Azure Storage can publish its own internal logs into a
$logsblob container; App Services provides web server and application logs and can save these to a number of different places as well as view them in real time; and Azure SQL logs provide a lot of optional diagnostic information and again have to be explicitly enabled.
Application logs are written by application developers. These can be sent to a number of different places, but a common destination is Application Insights. If logs are published into Application Insights they can be queried interactively, and used as part of alerts and dashboards. We'll discuss these in more detail in later parts of this series.
Azure Log Analytics is a central log consolidation, aggregation, and querying service. Some of the above logs are published automatically into Log Analytics, while others have to be configured to do so. Log Analytics isn't a free service, and needs to be provisioned separately if you want to configure logs to be sent into it. We'll discuss it more detail throughout this series.
Ingestion of Telemetry
Azure services automatically publish metrics into Azure Monitor, and these built-in metrics are ingested free of charge. Custom metrics can also be ingested by Azure Monitor, which we'll discuss in more detail in part 3 of this series.
As described in the previous section, different types of logs are ingested in different ways. Azure Monitor automatically ingests resource activity logs, and does so free of charge. The other types of logs are not ingested by Azure Monitor unless you explicitly opt into that, either by configuring Application Insights to receive custom logs, or by provisioning a Log Analytics workspace and then configuring your various components to send their logs to that.
Processing and Working With Telemetry
Once data has been ingested into Azure Monitor, it becomes available for a variety of different purposes. Many of these will be discussed in later parts of this series. For example, metrics can be used for dashboards (see part 6, coming soon) and for autoscale rules (see part 8, coming soon); logs that have been routed to Azure Monitor can be used as part of alerts (see part 5, coming soon); and all of the data can be exported (see part 9, coming soon).
Application Insights has been part of the Azure platform for around two years. Microsoft recently announced that it is considered to be part of the umbrella Azure Monitor service. However, Application Insights is deployed as a separate service, and is billable based on the amount of data it ingests. We'll cover Application Insights in more detail in part 2 of this series.
Summary of Instrumentation Components
There's a lot to take in here! The instrumentation story across Azure isn't always easy to understand, and although the complexity is reducing as Microsoft consolidates more and more of these services into Azure Monitor, there is still a lot to unpack. Here's a very brief summary:
Azure Monitor is the primary instrumentation service we generally interact with. Azure Monitor captures metrics from every Azure service, and it also captures some types of logs as well. More detailed diagnostic and activity logging can be enabled on a per-service or per-application basis, and depending on how you configure it, it may be routed to Azure Monitor or somewhere else like an Azure Storage account.
Custom data can be published into Azure Monitor through custom metrics (which we'll cover in part 3 of the series), through publishing custom logs into Log Analytics, and through Application Insights. Application Insights is a component that is deployed separately, and provides even more metrics and logging capabilities. It's built off the same infrastructure as the rest of Azure Monitor and is mostly queryable from the same places.
Once telemetry is published into Azure Monitor it's available for a range of different purposes including interactive querying, alerting, dashboarding, and exporting. We'll cover all of these in more detail throughout the series.
Instrumentation as Infrastructure
The idea of automating all of our infrastructure - scripting the setup of virtual machines or App Services, creating databases, applying schema updates, deploying our applications, and so forth - has become fairly uncontroversial. The benefits are so compelling, and the tools are getting so good, that generally most teams don't take much convincing that expressing their infrastructure as code is worthwhile. But in my experience working with a variety of customers, I've found that this often isn't the case with instrumentation.
Instrumentation components like dashboards, alerts, and availability tests are still frequently seen as being of a different category to the rest of an application. While it may seem perfectly reasonable to script out the creation of some compute resources, and for these scripts to be put into a version control system and built alongside the app itself, instrumentation is frequently handled manually and without the same level of automation rigour as the application code and scripts. As I'll describe below, I'm not opposed to using the Azure Portal and other similar tools to explore the metrics and logs associated with an application. But I believe that the instrumentation artifacts that come out of this exploration - saved queries, dashboard widgets, alert rules, etc - are just as important as the rest of our application components, and should be treated with the same level of diligence.
As with any other type of infrastructure, there are some clear benefits to expressing instrumentation components as code compared to using the Azure Portal including:
Reducing risk of accidental mistakes: I find that expressing my instrumentation logic explicitly in code, scripts, or ARM templates makes me far less likely to make a typo, or to do something silly like confuse different units of measurement when I'm setting an alert threshold.
Peer review: For teams that use a peer review process in their version control system, treating infrastructure as code means that someone else on the team is expected to review the changes I'm making. If I do end up making a dumb mistake then it's almost always caught by a coworker during a review, and and even if there are no mistakes, having someone else on the team review the change means that someone else understands what's going on.
Version control: Keeping all of our instrumentation logic and alert rules in a version control system is helpful when we want to understand how instrumentation has evolved over time, and for auditability.
Keeping related changes together: I'm a big fan of keeping related changes together. For example, if I create a pull request to add a new application component then I can add the application code, the deployment logic, and the instrumentation for that new component all together. This makes it easier to understand the end-to-end scope of the feature being added. If we include instrumentation in our 'definition of done' for a feature then we can easily see that this requirement is met during the code review stage.
Managing multiple environments: When instrumentation rules and components aren't automated, it's easy for them to get out of sync between environments. In most applications there is at least one dev/test environment as well as production. While it might seem unnecessary to have alerts and monitoring in a dev environment, I will argue in part 2 of this series that it's important to do so, even if you have slightly different rules and thresholds. Deploying instrumentation as code means that these environments can be kept in sync. Similarly, you may deploy your production environment to multiple regions for georedundancy or for performance reasons. If your instrumentation components are kept alongside the rest of your infrastructure, you'll get the same alerts and monitoring for all of your regions.
Avoid partial automation: In my experience, partially automating an application can sometimes result in more complexity than not automating it at all. For example, if you use ARM templates and (as I typically suggest) use the 'complete' deployment mode, then any components you may have created manually through the Azure Portal can be removed. Many of the instrumentation components we'll discuss are ARM resources and so can be subject to this behaviour. Therefore, a lack of consistency across how we deploy all of our infrastructure and instrumentation can result in lost work, missed alerts, hard-to-find bugs, and generally odd instrumentation behaviour.
Using the Azure Portal
Having an instrumentation-first mindset doesn't mean that we can't or shouldn't ever use the Azure Portal. In fact, I tend to use it quite a lot - but for specific purposes.
First, I tend to use it a lot for interactively querying metrics and logs in response to an issue, or just to understand how my systems are behaving. I'll use Metrics Explorer to create and view charts of potentially interesting metrics, and I'll write log queries and execute them from Application Insights or Log Analytics.
Second, when I'm responding to alerts, I'll make use of the portal's tooling to view details, track the status of the alert, and investigate what might be happening. We'll discuss alerts more later in this series.
Third, I use the portal for monitoring my dashboards. We'll talk about dashboards in part 6 (coming soon). Once they're created, I'll often check on them to make sure that all of my metrics look to be in a normal range and that everything appears healthy.
Fourth, when I'm developing new alerts, dashboard widgets, or other components, I'll create test resources using the portal. I'lll use my existing automation scripts to deploy a short-term copy of my environment temporarily, then deploy a new alert or autoscale rule using the portal, and then export them to an ARM template or manually construct a template based on what gets deployed by the portal. This way I can see how things should work, and get to use the portal's built-in validation and assistance with creating the components, but still get everything into code form eventually. Many of the ARM templates I'll provide throughout this series were created in this way.
Finally, during an emergency - when a system is down, or something goes wrong in the middle of the night - I'll sometimes drop the automation-first requirement and create alerts on the fly, even on production, but knowing that I'll need to make sure I add it into the automation scripts as soon as possible to ensure everything stays in sync.
This post has outlined the basics of Azure's instrumentation platform. The two main types of data we tend to work with are metrics and logs. Metrics are numerical values that represent the state of a system at a particular point in time. Logs come in several variants, some of which are published automatically and some of which need to be enabled and then published to a suitable location before they can be queried. Both metrics and logs can be processed by Azure Monitor, and over the course of this series we'll look at how we can script and automate the ingestion, processing, and handling of a variety of types of instrumentation data.
Automation of Azure Monitor and other instrumentation components is something that I've found to be quite poorly documented, so in writing this series I've aimed to provide both explanations of how these parts can be built, and set of sample ARM templates and scripts that you can adapt to your own environment.
In the next part we'll discuss Application Insights, and some of the automation we can achieve with that. We'll also look at a pattern I typically use for deploying different levels of instrumentation into different environments.