John Downs

Building Human-Focused Software

Automating Azure Instrumentation and Monitoring – Part 5: Log Alerts

This post was originally published on the Kloud blog.

In the previous part of this series, we looked at the basic structure of Azure Monitor alerts, and then specifically at metric alerts. In this part we will consider other types of alert that Azure Monitor can emit. We will first discuss application log alerts - sometimes simply called log alerts - which let us be notified about important data emitted into our application logs. Next we will discuss activity log alerts, which notify us when events happen within Azure itself. These include service health alerts, which Azure emits when there are issues with a service.

This post is part of a series:

  • Part 1 provides an introduction to the series by describing why we should instrument our systems, outlines some of the major tools that Azure provides such as Azure Monitor, and argues why we should be adopting an 'infrastructure as code' mindset for our instrumentation and monitoring components.

  • Part 2 describes Azure Application Insights, including its proactive detection and alert features. It also outlines a pattern for deploying instrumentation components based on the requirements we might typically have for different environments, from short-lived development and test environments through to production.

  • Part 3 discusses how to publish custom metrics, both through Application Insights and to Azure Monitor. Custom metrics let us enrich the data that is available to our instrumentation components.

  • Part 4 covers the basics of alerts and metric alerts. Azure Monitor's powerful alerting system is a big topic, and in this part we'll discuss how it works overall, as well as how to get alerts for built-in and custom metrics.

  • Part 5 (this post) covers log alerts and resource health alerts, two other major types of alerts that Azure Monitor provides. Log alerts let us alert on information coming into Application Insights logs and Log Analytics workspaces, while resource health alerts us when Azure itself is having an issue that may result in downtime or degraded performance.

  • Part 6 (coming soon) describes dashboards. The Azure Portal has a great dashboard UI, and our instrumentation data can be made available as charts. Dashboards are also possible to automate, and I'll show a few tips and tricks I've learned when doing this.

  • Part 7 (coming soon) covers availability tests, which let us proactively monitor our web applications for potential outages. We'll discuss deploying and automating both single-step (ping) and multi-step availability tests.

  • Part 8 (coming soon) describes autoscale. While this isn't exactly instrumentation in and of itself, autoscale is built on much of the same data used to drive alerts and dashboards, and autoscale rules can be automated as well.

  • Finally, part 9 (coming soon) covers exporting data to other systems. Azure Monitor metrics and log data can be automatically exported, as can Application Insights data, and the export rules can be exported and used from automation scripts.

Application Log Alerts

Azure supports application-level logging into two main destinations: Application Insights (which we discussed in part 2 of this series) and Azure Monitor's own log management system. Both of these services receive log entries, store and index them within a few minutes, and allow for interactive querying through a powerful syntax called KQL. Additionally, we can create scheduled log query alert rules that run on the log data.

Note: Microsoft recently announced that they have renamed the service previously known as Log Analytics to Azure Monitor logs. This is a sensible change, in my opinion, since it reflects the fact that logs are just another piece of data within Azure Monitor.

Scheduled log query alert rules are relatively simple: at a frequency that we specify, they run a defined query and then look at the result. If the result matches criteria that we have specified then an alert is fired.

Like metric alert rules, scheduled log alert rules specify the conditions under which an alert should fire, but they don't specify the process by which a human or system should be notified. Action groups, described in detail in part 4 of this series, fill that role.

Important: Like metric alerts, log alerts cost money - and there is no free quota provided for log alerts currently. Be aware of this when you create test alert rules!

There are several key pieces of a scheduled log query alert rule:

  • Data source is a reference to the location that stores the logs, such as an Application Insights instance. This is similar to the scopes property in metric alerts. Note that in ARM templates for log alerts, we specify the data source twice - once as a tag, and once as a property of the alert rule resource. Also, note that we can perform cross-workspace queries (for example, joining data from Application Insights with data in Azure Monitor logs); when we do so, we need to specify the full list of data sources we're querying in the authorizedResources property.

  • Query is a KQL query that should be executed on a regular basis.

  • Query type indicates how the results of the query should be interpreted. For example, if we want to count the number of results and compare it to a threshold, we would use the ResultCount query type.

  • Trigger specifies the critical threshold for the query results. This includes both a threshold value and a comparison operator.

  • Schedule specifies how frequently the log query should run, and the time window that the log query should consider. Note that more frequent executions result in a higher cost.

  • Severity is the importance of the alert rule, which (as described in part 4 of this series) may be helpful information to whomever is responding to an alert from this rule so that they can understand its importance.

  • Actions are the action groups that should be invoked when an alert is fired.

  • Metadata includes a name and description of the alert rule.

Full documentation is available within the ARM API documentation site. There are some idiosyncrasies with these ARM templates, including the use of a mandatory tag and the fact that the enabledproperty is a string rather than a boolean value, so I suggest copying a known working example and modifying it incrementally.


I recently worked on an application that had intermittent errors being logged into Application Insights. We quickly realised that the errors were logged by Windows Communication Foundation within our application. These errors indicated a problem that the development team needed to address, so in order to monitor the situation we configured an alert rule as follows:

  • Data source was the Application Insights instance for the application.

  • Query was the following KQL string: exceptions | where type matches regex 'System.ServiceModel.*'. This looked for all data within the exceptions index that contained a type field with the term System.ServiceModel inside it, using a regular expression to perform the match. (KQL queries can be significantly more complex than this, if you need them to be!)

  • Query type was ResultCount, since we were interested in monitoring the number of log entries matching the query.

  • Trigger was set to greater than and 0 for the operator and threshold, respectively.

  • Schedule was set to evaluate the query every five minutes, and to look back at the last five minutes, which meant that we have round-the-clock monitoring on this rule.

  • Severity was 3, since we considered this to be a warning-level event but not an immediate emergency.

  • Action was set to an action group that sent an email to the development team.

The following ARM template creates this alert rule using an Application Insights instance:

Note that this ARM template also creates an action group with an email action, but of course you can have whatever action groups you want; you can also refer to shared action groups in other resource groups.

Of course, if you have data within Azure Monitor logs (previously Log Analytics workspaces) then the same process applies there, but with a different data source.

Metric Log Alerts

There is also a special scenario available: when certain log data gets ingested into Azure Monitor logs workspaces, it is made available for metric alerting. These alerts are for data including performance counters from virtual machines and certain other types of well-known log data. In these cases, logs are used to transmit the data but it is fundamentally a metric, so this feature of Azure Monitor exposes it as such. More information on this feature, including an example ARM template, is available here.

Activity Log Alerts

Azure's activity log is populated by Azure automatically. It includes a number of different types of data, including resource-level operations (e.g. resource creation, modification, and deletion), service health data (e.g. when a maintenance event is planned for a virtual machine), and a variety of other types of log data that can be specific to individual resource types. More detail about the data captured by the activity log is available on the Azure documentation pages.

Activity log alert rules can be created through ARM templates using the resource type Microsoft.Insights/activityLogAlerts. The resource properties are similar to those on metric alert rules, but with a few differences. Here are the major properties:

  • Scope is the resource that we want to monitor the activity log entries from, and alert on. Note that we can provide a resource group or even a subscription reference here, and all of the resources within those scopes will be covered.

  • Condition is the set of specific rules that should be evaluated. These are a set of boolean rules to evaluate log entries' categories, resource types, and other key properties. Importantly, you must always provide a category filter.

  • Actions are references to the action group (or groups) that should be invoked when an alert is fired.

Unlike application log queries, we don't specify a KQL query or a time window; we instead have a simpler set of boolean criteria to use to filter events and get alerts.

By using activity log alerts, we can set up rules like alert me whenever a resource group is deleted within this Azure subscription and alert me whenever a failed ARM template deployment happens within this resource group. Here is an example ARM templates that covers both of these scenarios:

Note that activity log alert resources should be created in the global location.

More information on activity log alerts is available here; more detail on the ARM template syntax and other ways of automating their creation is available here; and even more detail on the ARM resource properties is available here.

Service Health Alerts

Azure provides service health events to advise of expected as well as unexpected issues with Azure services. For example, when virtual machines have a maintenance window scheduled, Azure publishes a service health event to notify you of this fact. Similarly, if Azure had a problem with a particular service (e.g. Azure Storage), it would typically publish a service health event to advise of the incident details, often both during the incident and after the incident has been resolved.

Service health events are published into the activity log. A great deal of information is available about these events, and as a result, activity log alert rules can be used to monitor for service health events as well. Simply use the ServiceHealth category, and then the properties available on service health events, to filter them as appropriate. An example ARM template is available within the Microsoft documentation for service health alerts.

Resource Health Alerts

Azure also helps to filter the relevant service health events into another category of activity log event, using the ResourceHealth category. While service health events provide information about planned maintenance and incidents that may affect entire Azure services, resource health events are specific to your particular resource. They essentially filter and collapse service health events into a single health status for a given resource. Once again, Microsoft provide an example ARM template within their documentation.


In this post we have discussed the different types of log alerts that Azure Monitor provides. Scheduled log query alert rules let us define queries that we should run on the structured, semi-structured, or unstructured logs that our applications emit, and then have these queries run automatically and alert us when their results show particular signals that we need to pay attention to. Activity log alert rules let us monitor data that is emitted by Azure itself, including by Azure Resource Manager and by Azure's service health monitoring systems.

We have now discussed the key components of Azure Monitor's alerting system. The alerting system works across four main resource types: action groups, and then the different types of alert rules (metric, scheduled log query, and activity log). By using all of these components together, we can create robust monitoring solutions that make use of data emitted by Azure automatically, and by custom log data and metrics that we report into Azure Monitor ourselves.

In the next post of this series we will discuss another important aspect of interacting with data that has been sent into Azure Monitor: viewing and manipulating data using dashboards.

Automating Azure Instrumentation and Monitoring - Part 4: Metric Alerts

This post was originally published on the Kloud blog.

One of the most important features of Azure Monitor is its ability to send alerts when something interesting happens - in other words, when our telemetry meets some criteria we have told Azure Monitor that we're interested in. We might have alerts that indicate when our application is down, or when it's getting an unusually high amount of traffic, or when the response time or other performance metrics aren't within the normal range. We can also have alerts based on the contents of log messages, and on the health status of Azure resources as reported by Azure itself. In this post, we'll look at how alerts work within Azure Monitor and will see how these can be automated using ARM templates. This post will focus on the general workings of the alerts system, including action groups, and on metric alerts; part 5 (coming soon) will look at log alerts and resource health alerts.

This post is part of a series:

  • Part 1 provides an introduction to the series by describing why we should instrument our systems, outlines some of the major tools that Azure provides such as Azure Monitor, and argues why we should be adopting an 'infrastructure as code' mindset for our instrumentation and monitoring components.

  • Part 2 describes Azure Application Insights, including its proactive detection and alert features. It also outlines a pattern for deploying instrumentation components based on the requirements we might typically have for different environments, from short-lived development and test environments through to production.

  • Part 3 discusses how to publish custom metrics, both through Application Insights and to Azure Monitor. Custom metrics let us enrich the data that is available to our instrumentation components.

  • Part 4 (this post) covers the basics of alerts and metric alerts. Azure Monitor's powerful alerting system is a big topic, and in this part we'll discuss how it works overall, as well as how to get alerts for built-in and custom metrics.

  • Part 5 covers log alerts and resource health alerts, two other major types of alerts that Azure Monitor provides. Log alerts let us alert on information coming into Application Insights logs and Log Analytics workspaces, while resource health alerts us when Azure itself is having an issue that may result in downtime or degraded performance.

  • Part 6 (coming soon) describes dashboards. The Azure Portal has a great dashboard UI, and our instrumentation data can be made available as charts. Dashboards are also possible to automate, and I'll show a few tips and tricks I've learned when doing this.

  • Part 7 (coming soon) covers availability tests, which let us proactively monitor our web applications for potential outages. We'll discuss deploying and automating both single-step (ping) and multi-step availability tests.

  • Part 8 (coming soon) describes autoscale. While this isn't exactly instrumentation in and of itself, autoscale is built on much of the same data used to drive alerts and dashboards, and autoscale rules can be automated as well.

  • Finally, part 9 (coming soon) covers exporting data to other systems. Azure Monitor metrics and log data can be automatically exported, as can Application Insights data, and the export rules can be exported and used from automation scripts.

What Are Alerts?

Alerts are described in detail on the Azure Monitor documentation, and I won't re-hash the entire page here. Here is a quick summary, though.

An alert rule defines the situations under which an alert should fire. For example, an alert rule might be something like when the average CPU utilisation goes above 80% over the last hour, or when the number of requests that get responses with an HTTP 5xx error code goes above 3 in the last 15 minutes. An alert is a single instance in which the alert rule fired. We tell Azure Monitor what alert rules we want to create, and Azure Monitor creates alerts and sends them out.

Alert rules have three logical components:

  • Target resource: the Azure resource that should be monitored for this alert. For example, this might be an app service, a Cosmos DB account, or an Application Insights instance.

  • Rule: the rule that should be applied when determining whether to fire an alert for the resource. For example, this might be a rule like when average CPU usage is greater than 50% within the last 5 minutes, or when a log message is written with a level of Warning. Rules include a number of sub-properties, and often include a time window or schedule that should be used to evaluate the alert rule.

  • Action: the actions that should be performed when the alert has fired. For example, this might be email or invoke a webhook at Azure Monitor provides a number of action types that can be invoked, which we'll discuss below.

There are also other pieces of metadata that we can set when we create alert rules, including the alert rule name, description, and severity. Severity is a useful piece of metadata that will be propagated to any alerts that fire from this alert rule, and allows for whoever is responding to understand how important the alert is likely to be, and to prioritise their list of alerts so that they deal with the most important alerts first.

Classic Alerts

Azure Monitor currently has two types of alerts. Classic alerts are the original alert type supported by Azure Monitor since its inception, and can be contrasted with the newer alerts - which, confusingly, don't seem to have a name, but which I'll refer to as newer alerts for the sake of this post.

There are many differences between classic and newer alerts. One such difference is that in classic alerts, actions and rules are mixed into a single 'alert' resource, while in newer alerts, actions and rules are separate resources (as described below in more detail). A second difference is that as Azure migrates from classic to newer alerts, some Azure resource types only support classic alerts, although these are all being migrated across to newer alerts.

Microsoft recently announced that classic alerts will be retired in June 2019, so I won't spend a lot of time discussing them here, although if you need to create a classic alert with an ARM template before June 2019, you can use this documentation page as a reference.

All of the rest of this discussion will focus on newer alerts.

Alert Action Groups

A key component of Azure Monitor's alert system is action groups, which define how an alert should be handled. Importantly, action groups are independent of the alert rule that triggered them. An alert rule defines when and why an alert should be fired, while an action group defines how the alert should be sent out to interested parties. For example, an action group can send an email to a specified email address, send an SMS notification, invoke a webhook, trigger a Logic App, or perform a number of other actions. A single action group can perform one or several of these actions.

Action groups are Azure Resource Manager resources in their own right, and alert rules then refer to them. This means we can have shared action groups that work across multiple alerts, potentially spread across multiple applications or multiple teams. We can also create specific action groups for defined purposes. For example, in an enterprise application you might have a set of action groups like this:

Action Group Name Resource Group Actions Notes
CreateEnterpriseIssue Shared-OpsTeam Invoke a webhook to create issue in enterprise issue tracking system. This might be used for high priority issues that need immediate, 24x7 attention. It will notify your organisation's central operations team.
SendSmsToTeamLead MyApplication Send an SMS to the development team lead. This might be used for high priority issues that also need 24x7 attention. It will notify the dev team lead.
EmailDevelopmentTeam MyApplication Send an email to the development team's shared email alias. This might be used to ensure the development team is aware of all production issues, including lower-priority issues that only need attention during business hours.

Of course, these are just examples; you can set up any action groups that make sense for your application, team, or company.

Automating Action Group Creation

Action groups can be created and updated using ARM templates, using the Microsoft.Insights/actionGroups resource type. The schema is fairly straightforward, but one point to consider is the groupShortName property. The short name is used in several places throughout Azure Monitor, but importantly it is used to identify the action group on email and SMS message alerts that Azure Monitor sends. If you have multiple teams, multiple applications, or even just multiple alert groups, it's important to choose a meaningful short name that will make sense to someone reading the alert. I find it helpful to put myself in the mind of the person (likely me!) who will be woken at 3am to a terse SMS informing them that something has happened; they will be half asleep while trying to make sense of the alert that they have received. Choosing an appropriate action group short name may help save them several minutes of troubleshooting time, reducing the time to diagnosis (and the time before they can return to bed). Unfortunately these short names must be 12 characters or fewer, so it's not always easy to find a good name to use.

With this in mind, here is an example ARM template that creates the three action groups listed above:

Note that this will create all three action groups in the same resource group, rather than using separate resource groups for the shared and application-specific action groups.

Once the action groups have been created, any SMS and email recipients will receive a confirmation message to let them know they are now in the action group. They can also unsubscribe from the action group if they choose. If you use a group email alias, it's important to remember that if one recipient unsubscribes then the whole email address action will be disabled for that alert, and nobody on the email distribution list will get those alerts anymore.

Metric Alerts

Now that we know how to create action groups that are ready to receive alerts and route them to the relevant people and places, let's look at how we create an alert based on the metrics that Azure Monitor has recorded for our system.

Important: Metric alerts are not free of charge, although there is a small free quota you get. Make sure you remove any test alert rules once you're done, and take a look at the pricing information for more detail.

A metric alert rule has a number of important properties:

  • Scope is the resource that has the metrics that we want to monitor and alert on.

  • Evaluation frequency is how often Azure Monitor should check the resource to see if it meets the criteria. This is specified as an ISO 8601 period - for example, PT5M means check this alert every 5 minutes.

  • Window size is how far back in time Azure Monitor should look when it checks the criteria. This is also specified as an ISO 8601 period - for example, PT1H means when running this alert, look at the metric history for the last 1 hour. This can be between 5 minutes and 24 hours.

  • Criteria are the specific rules that should be evaluated. There is a sophisticated set of functionality available when specifying criteria, but commonly this will be something like (for an App Service) look at the number of requests that resulted in a 5xx status code response, and alert me if the count is greater than 3 or (for a Cosmos DB database) look at the number of requests where the StatusCode dimension was set to the value 429 (representing a throttled request), and alert me if the count is greater than 1.

  • Actions are references to the action group (or groups) that should be invoked when an alert is fired.

Each of these properties can be set within an ARM template using the resource type Microsoft.Insights/metricAlerts. Let's discuss a few of these in more detail.


As we know from earlier in this series, there are three main ways that metrics get into Azure Monitor:

  • Built-in metrics, which are published by Azure itself.

  • Custom resource metrics, which are published by our applications and are attached to Azure resources.

  • Application Insights allows for custom metrics that are also published by our applications, but are maintained within Application Insights rather than tied to a specific Azure resource.

All three of these metric types can have alerts triggered from them. In the case of built-in and custom resource metrics, we will use the Azure resource itself as the scope of the metric alert. For Application Insights, we use the Application Insights resource (i.e. the resource of type Microsoft.Insights/components) as the scope.

Note that Microsoft has recently announced a preview capability of monitoring multiple resources in a single metric alert rule. This currently only works with virtual machines, and as it's such a narrow use case, I won't discuss it here. However, keep in mind that the scopes property is specified as an array because of this feature.


A criterion is a specification of the conditions under which the alert should fire. Criteria have the following sub-properties:

  • Name: a criterion can have a friendly name specified to help understand what caused an alert to fire.

  • Metric name and namespace: the name of the metric that was published, and if it's a custom metric, the namespace. For more information on metric namespaces see part 3 of this series. A list of built-in metrics published by Azure services is available here.

  • Dimensions: if the metric has dimensions associated with it, we can filter the metrics to only consider certain dimension values. Dimension values can be included or excluded.

  • Time aggregation: the way in which the metric should be aggregated - e.g. counted, summed, or have the maximum/minimum values considered.

  • Operator: the comparison operator (e.g. greater than, less than) that should be used when comparing the aggregated metric value to the threshold.

  • Threshold: the critical value at which the aggregated metric should trigger the alert to fire.

These properties can be quite abstract, so let's consider a couple of examples.

First, let's consider an example for Cosmos DB. We might have a business rule that says whenever we see more than one throttled request, fire an alert. In this example:

  • Metric name would be TotalRequests, since that is the name of the metric published by Cosmos DB. There is no namespace since this is a built-in alert. Note that, by default, TotalRequests is the count of all requests and not just throttled requests, so...

  • Dimension would be set to filter the StatusCode dimension to only include the value 429, since 429 represents a throttled request.

  • Operator would be GreaterThan, since we are interested in knowing when we see more than a single throttled request.

  • Threshold would be 1, since we want to know whether we received more than one throttled request.

  • Time aggregation would be Maximum. The TotalRequests metric is a count-based metric (i.e. each metric raw value represents the total number of requests for a given period of time), and so we want to look at the maximum value of the metric within the time window that we are considering.

Second, let's consider an example for App Services. We might have a business rule that says whenever our application returns more than three responses with a 5xx response code, fire an alert. In this example:

  • Metric name would be Http5xx, since that is the name of the metric published by App Services. Once again, there is no namespace.

  • Dimension would be omitted. App Services publishes the Http5xx metric as a separate metric rather than having a TotalRequests metric with dimensions for status codes like Cosmos DB. (Yes, this is inconsistent!)

  • Operator would again be GreaterThan.

  • Threshold would be 3.

  • Time aggregation would again be Maximum.

Note that a single metric alert can have one or more criteria. The odata.type property of the criteria property can be set to different values depending on whether we have a single criterion (in which case use Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria) or multiple (Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria). At the time of writing, if we use multiple criteria then all of the criteria must be met for the alert rule to fire.

Static and Dynamic Thresholds

Azure Monitor recently added a new preview feature called dynamic thresholds. When we use dynamic thresholds then rather than specifying the metric thresholds ourselves, we instead let Azure Monitor watch the metric and learn its normal values, and then alert us if it notices a change. The feature is currently in preview, so I won't discuss it in a lot of detail here, but there are example ARM templates available if you want to explore this.

Example ARM Templates

Let's look at a couple of ARM templates to create the metric alert rules we discussed above. Each template also creates an action group with an email action, but of course you can have whatever action groups you want; you can also refer to shared action groups in other resource groups.

First, here is the ARM template for the Cosmos DB alert rule (lines 54-99), which uses a dimension to filter the metrics (lines 74-81) like we discussed above:

Second, here is the ARM template for the App Services alert rule (lines 77 to 112):

Note: when I tried to execute the second ARM template, I sometimes found it would fail the first time around, but re-executing it worked. This seems to just be one of those weird things with ARM templates, unfortunately.


Azure's built-in metrics provide a huge amount of visibility into the operation of our system components, and of course we can enrich these with our own custom metrics (see part 3 of this series). Once the data is available to Azure Monitor, Azure Monitor can alert us based on whatever criteria we want to establish. The definitions of these metric alert rules is highly automatable using ARM templates, as is the definition of action groups to specify what should happen when an alert is fired.

In the next part of this series we will look at alerts based on log data.

Automating Azure Instrumentation and Monitoring - Part 3: Custom Metrics

This post was originally published on the Kloud blog.

One of the core data types that Azure Monitor works with is metrics - numerical pieces of data that represent the state of an Azure resource or of an application component at a specific point in time. Azure publishes built-in metrics for almost all Azure services, and these metrics are available for querying interactively as well as for use within alerts and other systems. In addition to the Azure-published metrics, we can also publish our own custom metrics. In this post we'll discuss how to do this using both Azure Monitor's recently announced support for custom metrics, and with Application Insights' custom metrics features. We'll start by looking at what metrics are and how they work.

This post is part of a series:

  • Part 1 provides an introduction to the series by describing why we should instrument our systems, outlines some of the major tools that Azure provides such as Azure Monitor, and argues why we should be adopting an 'infrastructure as code' mindset for our instrumentation and monitoring components.

  • Part 2 describes Azure Application Insights, including its proactive detection and alert features. It also outlines a pattern for deploying instrumentation components based on the requirements we might typically have for different environments, from short-lived development and test environments through to production.

  • Part 3 (this post) discusses how to publish custom metrics, both through Application Insights and to Azure Monitor. Custom metrics let us enrich the data that is available to our instrumentation components.

  • Part 4 covers the basics of alerts and metric alerts. Azure Monitor's powerful alerting system is a big topic, and in this part we'll discuss how it works overall, as well as how to get alerts for built-in and custom metrics.

  • Part 5 covers log alerts and resource health alerts, two other major types of alerts that Azure Monitor provides. Log alerts let us alert on information coming into Application Insights logs and Log Analytics workspaces, while resource health alerts us when Azure itself is having an issue that may result in downtime or degraded performance.

  • Part 6 (coming soon) describes dashboards. The Azure Portal has a great dashboard UI, and our instrumentation data can be made available as charts. Dashboards are also possible to automate, and I'll show a few tips and tricks I've learned when doing this.

  • Part 7 (coming soon) covers availability tests, which let us proactively monitor our web applications for potential outages. We'll discuss deploying and automating both single-step (ping) and multi-step availability tests.

  • Part 8 (coming soon) describes autoscale. While this isn't exactly instrumentation in and of itself, autoscale is built on much of the same data used to drive alerts and dashboards, and autoscale rules can be automated as well.

  • Finally, part 9 (coming soon) covers exporting data to other systems. Azure Monitor metrics and log data can be automatically exported, as can Application Insights data, and the export rules can be exported and used from automation scripts.

What Are Metrics?

Metrics are pieces of numerical data. Each metric has both a value and a unit. Here are some example metrics:

Example Value Unit
12 requests per second 12 requests per second
54 gigabytes 54 gigabytes
7 queue messages 7 queue messages

Metrics can be captured either in their raw or aggregated form. An aggregated metric is a way of simplifying the metric across a given period of time. For example, consider a system that processes messages from a queue. We could count the number of messages processed by the system in two ways: we could adjust our count every time a message is processed, or we could check the number of messages on the queue every minute, and batch these into five-minute blocks. The latter is one example of an aggregated metric.

Because metrics are numerical in nature, they can be visualised in different ways. For example, a line chart might show the value of a metric over time.

Azure Monitor also supports adding dimensions to metrics. Dimensions are extra pieces of data that help to add context to a metric. For example, the Azure Monitor metric for the number of messages in a Service Bus namespace has the entity (queue or topic) name as a dimension. Queries and visualisations against this metric can then filter down to specific topics, can visualise each topic separately, or can roll up all topics/queues and show the total number of messages across the whole Service Bus namespace.

Azure Monitor Metrics

Azure Monitor currently has two metric systems.

  • Classic metrics were traditionally published by most Azure services. When we use the Azure Portal, the Metrics (Classic) page displays these metrics.

  • Near real-time metrics are the newer type of metrics, and Azure is moving to use this across all services. As their name suggests, these metrics get updated more frequently than classic metrics - where classic metrics might not appear for several minutes, near real-time metrics typically are available for querying within 1-2 minutes of being published, and sometimes much quicker than that. Additionally, near real-time metrics are the only metric type that supports dimensions; classic metrics do not. Custom metrics need to be published as near real-time metrics.

Over time, all Azure services will move to the near real-time metrics system. In the meantime, you can check whether a given Azure service is publishing to the classic or newer metric system by checking this page. In this post we'll only be dealing with the newer (near real-time) metric system.

Custom Metrics

Almost all Azure services publish their own metrics in some form, although the usefulness and quality varies depending on the specific service. Core Azure services tend to have excellent support for metrics. Publishing of built-in metrics happens automatically and without any interaction on our part. The metrics are available for interactive querying through the portal and API, and for alerts and all of the other purposes we discussed in part 1 of this series.

There are some situations where the built-in metrics aren't enough for our purposes. This commonly happens within our own applications. For example, if our application has components that process messages from a queue then it can be helpful to know how many messages are being processed per minute, how long each message takes to process, and how many messages are currently on the queues. These metrics can help us to understand the health of our system, to provision new workers to help to process messages more quickly, or to understand bottlenecks that our developers might need to investigate.

There are two ways that we can publish custom metrics into Azure Monitor.

  • Azure Monitor custom metrics, currently a preview service, provides an API for us to send metrics into Azure Monitor. We submit our metrics to an Azure resource, and the metrics are saved alongside the built-in metrics for that resource.

  • Application Insights also provides custom metrics. Our applications can create and publish metrics into Application Insights, and they are accessible by using Application Insights' UI, and through the other places that we work with metrics. Although the core offering of publishing custom metrics into Application Insights is generally available, some specific features are in preview.

How do we choose which approach to use? Broadly speaking I'd generally suggest using Azure Monitor's custom metrics API for publishing resource or infrastructure-level metrics - i.e. enriching the data that Azure itself publishes about a resource - and I'd suggest using Application Insights for publishing application-level metrics - i.e. metrics about our own application code.

Here's a concrete example, again related to queue processing. If we have an application that processes queue messages, we'll typically want instrumentation to understand how these queues and processors are behaving. If we're using Service Bus queues or topics then we get a lot of instrumentation about our queues, including the number of messages that are currently on the queue. But if we're using Azure Storage queues, we're out of luck. Azure Storage queues don't have the same metrics, and we don't get the queue lengths from within Azure Monitor. This is an ideal use case for Azure Monitor's custom metrics.

We may also want to understand how long it's taking us to process each message - from the time it was submitted to the queue to the time it completed processing. This is often an important metric to ensure that our data is up-to-date and that users are having the best experience possible. Ultimately this comes down to how long our application is taking to perform its logic, and so this is an application-level concern and not an infrastructure-level concern. We'll use Application Insights for this custom metric.

Let's look at how we can write code to publish each of these metrics.

Publishing Custom Resource Metrics

In order to publish a custom resource metric we need to do the following:

  • Decide whether we will add dimensions.

  • Decide whether we will aggregate our metric's value.

  • Authenticate and obtain an access token.

  • Send the metric data to the Azure Monitor metrics API.

Let's look at each of these in turn, in the context of an example Azure Function app that we'll use to send our custom metrics.

Adding Dimensions

As described above, dimensions let us add extra data to our metrics so that we can group and compare them. We can submit metrics to Azure Monitor with or without dimensions. If we want to include dimensions, we need to include two extra properties - dimNames specifies the names of the dimensions we want to add to the metric, and dimValues specifies the values of those dimensions. The order of the dimension names and values must match so that Azure Monitor can relate the value to its dimension name.

Aggregating Metrics

Metrics are typically queried in an aggregated form - for example, counting or averaging the values of metrics to get a picture of how things are going overall. When submitting custom metrics we can also choose to send our metric values in an aggregated form if we want. The main reasons we'd do this are:

  • To save cost. Azure Monitor custom metrics aren't cheap when you use them at scale, and so pre-aggregating within our application means we don't need to incur quite as high a cost since we aren't sending as much raw data to Azure Monitor to ingest.

  • To reduce a very high volume of metrics. If we have a large number of metrics to report on, it will likely be much faster for us to send the aggregated metric to Azure Monitor rather than sending each individual metric.

However, it's up to us - we can choose to send individual values if we want.

If we send aggregated metrics then we need to construct a JSON object to represent the metric as follows:

For example, let's imagine we have recorded the following queue lengths (all times in UTC):

Time Length
11:00am 1087
11:01am 1124
11:02am 826
11:03am 888
11:04am 1201
11:05am 1091

We might send the following pre-aggregated metrics in a single payload:

Azure Monitor would then be able to display the aggregated metrics for us when we query.

If we chose not to send the metrics in an aggregated form, we'd send the metrics across individual messages; here's an example of the fourth message:

Security for Communicating with Azure Monitor

We need to obtain an access token when we want to communicate with Azure Monitor. When we use Azure Functions, we can make use of managed identities to simplify this process a lot. I won't cover all the details of managed identities here, but the example ARM template for this post includes the creation and use of a managed identity. Once the function is created, it can use the following code to obtain a token that is valid for communication with Azure Monitor:

The second part of this process is authorising the function's identity to write metrics to resources. This is done by using the standard Azure role-based access control system. The function's identity needs to be granted the Monitoring Metrics Publisher role, which has been defined with the well-known role definition ID 3913510d-42f4-4e42-8a64-420c390055eb.

Sending Custom Metrics to Azure Monitor

Now we have our metric object and our access token ready, we can submit the metric object to Azure Monitor. The actual submission is fairly easy - we just perform a POST to a URL. However, the URL we submit to will be different depending on the resource's location and resource ID, so we dynamically construct the URL as follows:

We might deploy into the West US 2 region, so an example URL might look like this:

Currently Azure Monitor only supports a subset of Azure regions for custom metrics, but this list is likely to grow as the feature moves out of preview.

Here is the full C# Azure Function we use to send our custom metrics:

Testing our Function

I've provided an ARM template that you can deploy to test this:

Make sure to deploy this into a region that supports custom metrics, like West US 2.

Once you've deployed it, you can create some queues in the storage account (use the storage account that begins with q and not the one that begins with fn). Add some messages to the queues, and then run the function or wait for it to run automatically every five minutes.

Then you can check the metrics for the storage queue, making sure to change the metric namespace to queueprocessing:


You should see something like the following:


As of the time of writing (January 2019), there is a bug where Azure Storage custom metrics don't display dimensions. This will hopefully be fixed soon.

Publishing Custom Application Metrics

Application Insights also allows for the publishing of custom metrics using its own SDK and APIs. These metrics can be queried through Azure Monitor in the same way as resource-level metrics. The process by which metrics are published into Application Insights is quite different to how Azure Monitor custom metrics are published, though.

The Microsoft documentation on Application Insights custom metrics is quite comprehensive, so rather than restate it here I will simply link to the important parts. I will focus on the C# SDK in this post.

To publish a custom metric to Application Insights you need an instance of the TelemetryClient class. In an Azure Functions app you can set the APPINSIGHTS_INSTRUMENTATIONKEY application setting - for example, within an ARM template - and then create an instance of TelemetryClient. The TelemetryClient will find the setting and will automatically configure itself to send telemetry to the correct place.

Once you have an instance of TelemetryClientyou can use the GetMetric().TrackValue() method to log a new metric value, which is then pre-aggregated and sent to Application Insights after a short delay. Dimensions can also be set using the same method. There are a number of overloads of this method that can be used to submit custom dimensions, too.

Note that as some features are in preview, they don't work consistently yet - for example, at time of writing custom namespaces aren't honoured correctly, but this should hopefully be resolved soon.

If you want to send raw metrics rather than pre-aggregated metrics, the legacy TrackMetric() method can be used, but Microsoft discourage its use and are deprecating it.

Here is some example Azure Function code that writes a random value to the My Test Metric metric:

And a full ARM template that deploys this is:


Custom metrics allow us to enrich our telemetry data with numerical values that can be aggregated and analysed, both manually through portal dashboards and APIs, and automatically using a variety of Azure Monitor features. We can publish custom metrics against any Azure resource by using the new custom metrics APIs, and we can also write application-level metrics to Application Insights.

In the next part of this series we will start to look at alerts, and will specifically look at metric alerts - one way to have Azure Monitor process the data for both built-in and custom metrics and alert us when things go awry.

Low-Cost Rate Limiting for Azure Functions APIs with API Management's Consumption Tier

This post was originally published on the Kloud blog.

Azure Functions can be used as a lightweight platform for building APIs. They support a number of helpful features for API developers including custom routes and a variety of output bindings that can implement complex business rules. They also have a consumption-based pricing model, which provides a low-cost, pay-per-use pricing model while you have low levels of traffic, but can scale or burst for higher levels of demand.

The Azure Functions platform also provides Azure Functions Proxies, which gives another set of features to further extend APIs built on top of Azure Functions. These features include more complex routing rules and the ability to do a small amount of request rewriting. These features have led some people to compare Azure Functions Proxies to a very lightweight API management system. However, there are a number of features of an API management platform that Azure Functions Proxies doesn't support. One common feature of an API management layer is the ability to perform rate limiting on incoming requests.

Azure API Management is a hosted API management service that provides a large number of features. Until recently, API Management's pricing model was often prohibitive for small APIs, since using it for production workloads required provisioning a service instance with a minimum of about a AUD$200 monthly cost. But Microsoft recently announced a new consumption tier for API Management. Based on a similar pricing model to Azure Functions, the consumption tier for API Management bills per request, which makes it a far more appealing choice for serverless APIs. APIs can now use features like rate limiting - and many others - without needing to invest in a large monthly expense.

In this post I'll describe how Azure Functions and the new API Management pricing tier can be used together to build a simple serverless API with rate limiting built in, and at a very low cost per transaction.

Note: this new tier is in preview, and so isn't yet ready for production workloads - but it will hopefully be generally available and supported soon. In the meantime, it's only available for previewing in a subset of Azure regions. For my testing I've been using Australia East.

Example Scenario

In this example, we'll build a simple serverless API that would benefit from rate limiting. In our example function we simulate performing some business logic to calculate shipping rates for orders. Our hypothetical algorithm is very sophisticated, and so we may later want to monetise our API to make it available for high-volume users. In the meantime we want to allow our customers to try it out a little bit for free, but we want to put limits around their use.

There may be other situations where we need rate limiting too - for example, if we have a back-end system we call into that can only cope with a certain volume of requests, or that bills us when we use it.

First, let's write a very simple function to simulate some custom business logic.

Function Code

For simplicity I'm going to write a C# script version of an Azure Function. You could easily change this to a precompiled function, or use any of the other languages that Azure Functions supports.

Our simulated function logic is as follows:

  • Receive an HTTP request with a body containing some shipping details.

  • Calculate the shipping cost.

  • Return the shipping cost.

In our simulation we'll just make up a random value, but of course we may have much more sophisticated logic in future. We could also call into other back-end functions or APIs too.

Here's our function code:

If we paste this code into the Azure Functions portal, we'll be able to try it out, and sure enough we can get a result:

API Management Policy

Now that we've got our core API function working, the next step is to put an API Management gateway in front of it so we can apply our rate limiting logic. API Management works in terms of policies that are applied to incoming requests. When we work with the consumption tier of API Management we can make use of the policy engine, although there are some limitations. Even with these limitations, policies are very powerful and let us express and enforce a lot of complex rules. A full discussion of API Management's policy system is beyond the scope of this post, but I recommend reviewing the policy documentation.

Here is a policy that we can use to perform our rate limiting:

This policy uses the caller's IP address as the rate limit key. This means that if the same IP address makes three API calls within a 15-second period, it will get rate limited and told to try again later. Of course, we can adjust the lockout time, the number of calls allowed, and even the way that we group requests together when determining the rate limit.

Because we may have additional APIs in the future that would be subject to this rate limit, we'll create an API Management product and apply the policy to that. This means that any APIs we add to that product will have this policy applied.

Securing the Connection

Of course, there's not much point in putting an API Management layer in front of our function API if someone can simply go around it and call the function directly. There are a variety of ways of securing the connection between an API Management instance and a back-end Azure Functions app, including using function keys, function host keys, and Azure AD tokens. In other tiers of API Management you can also use the IP address of the API Management gateway, but in the consumption tier we don't get any IP addresses to perform whitelisting on.

For this example we'll use the function key for simplicity. (For a real production application I'd recommend using a different security model, though.) This means that we will effectively perform a key exchange:

  • Requests will arrive into the API Management service without any keys.

  • The API Management service will perform its rate limiting logic.

  • If this succeeds, the API Management service will call into the function and pass in the function key, which only it knows.

In this way, we're treating the API Management service as a trusted subsystem - we're configuring it with the credentials (i.e. the function key) necessary to call the back-end API. Azure API Management provides a configuration system to load secrets like this, but for simplicity we'll just inject the key straight into a policy. Here's the policy we've used:

We'll inject the function key into the policy at the time we deploy the policy.

As this logic is specific to our API, we'll apply this policy to the API and not to our product.

Deploying Through an ARM Template

We'll use an ARM template to deploy this whole example. The template performs the following actions, approximately in this order:

  • Deploys the Azure Functions app.

  • Adds the shipping calculator function into the app using the deployment technique I discussed in a previous post.

  • Deploys an API Management instance using the consumption tier.

  • Creates an API in our API Management instance.

  • Configures an API operation to call into the shipping calculator function.

  • Adds a policy to the API operation to add the Azure Functions host key to the outbound request to the function.

  • Creates an API Management product for our shipping calculator.

  • Adds a rate limit policy to the product.

Here's the ARM template:

There's a lot going on here, and I recommend reading the API Management documentation for further detail on each of these. One important note is that whenever you interact with an API Management instance on the consumption tier using ARM templates, you must use API version 2018-06-01-preview or newer.

Calling our API

Now that we've deployed our API we can call it through our API Management gateway's public hostname. In my case I used Postman to make some API calls. The first few calls succeeded:

But then after I hit the rate limit, as expected I got an error response back:


Trying again 13 seconds later, the request succeeded. So we can see our API Management instance is configured correctly and is performing rate limiting as we expected.


With the new consumption tier of Azure API Management, it's now possible to have a low-cost set of API management features to deploy alongside your Azure Functions APIs. Of course, your APIs could be built on any technology, but if you are already in the serverless ecosystem then this is a great way to protect your APIs and back-ends. Plus, if your API grows to a point where you need more of the features of Azure API Management that aren't provided in the consumption tier, or if you want to switch to a fixed-cost pricing model, you can always upgrade your API Management instance to one of the higher tiers. You can do this by simply modifying the sku property on the API Management resource within the ARM template.

Integration Testing Timer-Triggered Precompiled v2 Azure Functions

This post was originally published on the Kloud blog.

In a recent post, I described a way to run integration tests against precompiled C# Azure Functions using the v2 runtime. In that post, we looked at an example of invoking an HTTP-triggered function from within an integration test.

Of course, there are plenty of other triggers available for Azure Functions too. Recently I needed to write an integration test against a timer-triggered function and decided to investigate the best way to do this.

The Azure Functions runtime provides a convenient API for invoking a timer-trigger function. You can issue an HTTP POST request against an endpoint for your function, and the function runtime will start up and trigger the function. I wasn't able to find any proper documentation about this, so this blog post is a result of some of my experimentation.

To invoke a timer-triggered function named MyFunction, you need to issue an HTTP request as follows:

Replace  with either your real Azure Functions hostname - such as - or, if you're running through the Azure Functions CLI, use localhost and the port number you're running it on, such as localhost:7071.

Interestingly, invoking this endpoint immediately returns an HTTP 201 response, but the function runs separately. This makes sense, though, since timer-trigger functions are not intended to return data in the same way that HTTP-triggered functions are.

I've created an updated version of the GitHub repository from the previous post with an example of running a test against a timer-triggered function. In this example, the function simply writes a message to an Azure Storage queue, which we can then look at to confirm the function has run. Normally the function would only run once a week, at 9.30am on Mondays, but our integration test triggers it each time it runs to verify that it works correctly.

In this version of the test fixture we also wait for the app to start before our test runs. We do this by polling the / endpoint with GET requests until it responds. This ensures that we can access the timer invocation HTTP endpoint successfully from our tests.

Of course, just like in the previous post's integration tests, a timer-triggered integration test can run from an Azure Pipelines build too, so you can include timer-triggered functions in your continuous integration and testing practices alongside the rest of your code. In fact, the same build.yaml that we used in the previous post can be used to run these tests, too.

Automating Azure Instrumentation and Monitoring - Part 2: Application Insights

This post was originally published on the Kloud blog.

Application Insights is a component of Azure Monitor for application-level instrumentation. It collects telemetry from your application infrastructure like web servers, App Services, and Azure Functions apps, and from your application code. In this post we'll discuss how Application Insights can be automated in several key ways: first, by setting up an Application Insights instance in an ARM template; second, by connecting it to various types of Azure application components through automation scripts including Azure Functions, App Services, and API Management; and third, by configuring its smart detection features to emit automatic alerts in a configurable way. As this is the first time in this series that we'll deploy instrumentation code, we'll also discuss an approach that can be used to manage the deployment of different types and levels of monitoring into different environments.

This post is part of a series:

  • Part 1 provides an introduction to the series by describing why we should instrument our systems, outlines some of the major tools that Azure provides such as Azure Monitor, and argues why we should be adopting an 'infrastructure as code' mindset for our instrumentation and monitoring components.

  • Part 2 (this post) describes Azure Application Insights, including its proactive detection and alert features. It also outlines a pattern for deploying instrumentation components based on the requirements we might typically have for different environments, from short-lived development and test environments through to production.

  • Part 3 discusses how to publish custom metrics, both through Application Insights and to Azure Monitor. Custom metrics let us enrich the data that is available to our instrumentation components.

  • Part 4 covers the basics of alerts and metric alerts. Azure Monitor's powerful alerting system is a big topic, and in this part we'll discuss how it works overall, as well as how to get alerts for built-in and custom metrics.

  • Part 5 covers log alerts and resource health alerts, two other major types of alerts that Azure Monitor provides. Log alerts let us alert on information coming into Application Insights logs and Log Analytics workspaces, while resource health alerts us when Azure itself is having an issue that may result in downtime or degraded performance.

  • Part 6 (coming soon) describes dashboards. The Azure Portal has a great dashboard UI, and our instrumentation data can be made available as charts. Dashboards are also possible to automate, and I'll show a few tips and tricks I've learned when doing this.

  • Part 7 (coming soon) covers availability tests, which let us proactively monitor our web applications for potential outages. We'll discuss deploying and automating both single-step (ping) and multi-step availability tests.

  • Part 8 (coming soon) describes autoscale. While this isn't exactly instrumentation in and of itself, autoscale is built on much of the same data used to drive alerts and dashboards, and autoscale rules can be automated as well.

  • Finally, part 9 (coming soon) covers exporting data to other systems. Azure Monitor metrics and log data can be automatically exported, as can Application Insights data, and the export rules can be exported and used from automation scripts.

Setting up Application Insights

When using the Azure Portal or Visual Studio to work with various types of resources, Application Insights will often be deployed automatically. This is useful when we're exploring services or testing things out, but when it comes time to building a production-grade application, it's better to have some control over the way that each of our components is deployed. Application Insights can be deployed using ARM templates, which is what we'll do in this post.

Application Insights is a simple resource to create from an ARM template. An instance with the core functionality can be created with a small ARM template:

There are a few important things to note about this template:

  • Application Insights is only available in a subset of Azure regions. This means you may need to deploy it in a region other than the region your application infrastructure is located in. The template above includes a parameter to specify this explicitly.

  • The name of your Application Insights instance doesn't have to be globally unique. Unlike resources like App Services and Cosmos DB accounts, there are no DNS names attached to an Application Insights instance, so you can use the same name across multiple instances if they're in different resource groups.

  • Application Insights isn't free. If you have a lot of data to ingest, you may incur large costs. You can use quotas to manage this if you're worried about it.

  • There are more options available that we won't cover here. This documentation page provides further detail.

After an Application Insights instance is deployed, it has an instrumentation key that can be used to send data to the correct Application Insights instance from your application. The instrumentation key can be accessed both through the portal and programmatically, including within ARM templates. We'll use this when publishing telemetry.

Publishing Telemetry

There are a number of ways to publish telemetry into Application Insights. While I won't cover them all, I'll give a quick overview of some of the most common ways to get data from your application into Application Insights, and how to automate each.

Azure Functions

Azure Functions has built-in integration with Application Insights. If you create an Azure Functions app through the portal it asks whether you want to set up this integration. Of course, since we're using automation scripts, we have to do a little work ourselves. The magic lies in an app setting, APPINSIGHTS_INSTRUMENTATIONKEY, which we can attach to any function app. Here is an ARM template that deploys an Azure Functions app (and associated app service plan and storage account), an Application Insights instance, and the configuration to link the two:

App Services

If we're deploying a web app into an app service using ASP.NET, we can use the ASP.NET integration directly. In fact, this works across many different hosting environments, and is described in the next section.

If you've got an app service that is not using ASP.NET, though, you can still get telemetry from your web app into Application Insights by using an app service extension. Extensions augment the built-in behaviour of an app service by installing some extra pieces of logic into the web server. Application Insights has one such extension, and we can configure it using an ARM template:

As you can see, similarly to the Azure Functions example, the APPINSIGHTS_INSTRUMENTATIONKEY is used here to link the app service with the Application Insights instance.

One word of warning - I've found that the site extension ARM resource isn't always deployed correctly the first time the template is deployed. If you get an error the first time you deploy, try it again and see if the problem goes away. I've tried, but have never fully been able to understand why this happens or how to stop it.

ASP.NET Applications

If you have an ASP.NET application running in an App Service or elsewhere, you can install a NuGet package into your project to collect Application Insights telemetry. This process is documented here. If you do this, you don't need to install the App Services extension from the previous section. Make sure to set the instrumentation key in your configuration settings and then flow it through to Application Insights from your application code.

API Management

If you have an Azure API Management instance, you might be aware that this can publish telemetry into Application Insights too. This allows for monitoring of requests all the way through the request handling pipeline. When it comes to automation, Azure API Management has very good support for ARM templates, and its Application Insights integration is no exception.

At a high level there are two things we need to do: first, we create a logger resource to establish the API Management-wide connection with Application Insights; and second, we create a diagnostic resource to instruct our APIs to send telemetry to the Application Insights instance we have configured. We can create a diagnostic resource for a specific API or to cover all APIs.

The diagnostic resource includes a sampling rate, which is the percentage of requests that should have their telemetry sent to Application Insights. There is a lot of detail to be aware of with this feature, such as the performance impact and the ways in which sampling can reduce that impact. We won't get into that here, but I encourage you to read more detail from Microsoft's documentation before using this feature.

Here's an example ARM template that deploys an API Management instance, an Application Insights instance, and configuration to send telemetry from every request into Application Insights:

Smart Detection

Application Insights provides a useful feature called smart detection. Application Insights watches your telemetry as it comes in, and if it notices unusual changes, it can send an alert to notify you. For example, it can detect the following types of issues:

  • An application suddenly sends back a higher rate of 5xx (error)-class status responses than it was sending previously.

  • The time it takes for an application to communicate with a database has increased significantly above the previous average.

Of course, this feature is not foolproof - for example, in my experience it won't detect slow changes in error rates over time that may still indicate an issue. Nevertheless, it is a very useful feature to have available to us, and it has helped me identify problems on numerous occasions.

Smart detection is enabled by default. Unless you configure it otherwise, smart detection alerts are sent to all owners of the Azure subscription in which the Application Insights instance is located. In many situations this is not desirable: when your Azure subscription contains many different applications, each with different owners; or when the operations or development team are not granted the subscription owner role (as they should not be!); or when the subscriptions are managed by a central subscription management team who cannot possibly deal with the alerts they receive from all applications. We can configure each smart detection alert using the proactiveDetectionConfigs ARM resource type.

Here is an example ARM template showing how the smart detection alerts can be redirected to an email address you specify:

In development environments, you may not want to have these alerts enabled at all. Development environments can be used sporadically, and can have a much higher error rate than normal, so the signals that Application Insights uses to proactively monitor for problems aren't as useful. I find that it's best to configure smart detection myself so that I can switch it on or off for different environments, and for those environments that do need it, I'll override the alert configuration to send to my own alert email address and not to the subscription owners. This requires us to have different instrumentation configuration for different environments.

Instrumentation Environments

In most real-world applications, we end up deploying the application in at least three environments: development environments, which are used by software developers as they actively work on a feature or change; non-production environments, which are used by testers, QA engineers, product managers, and others who need to access a copy of the application before it goes live; and production environments, which are used by customers and may be monitored by a central operations team. Within these categories, there can be multiple actual environments too - for example, there can be different non-production environments for different types of testing (e.g. functional testing, security testing, and performance testing), and some of these may be long-lived while others are short-lived.

Each of these different environments also has different needs for instrumentation:

  • Production environments typically need the highest level of alerting and monitoring since an issue may affect our customers' experiences. We'll typically have many alerts and dashboards set up for production systems. But we also may not want to collect large volumes of telemetry from production systems, especially if doing so may cause a negative impact on our application's performance.

  • Non-production environments may need some level of alerting, but there are certain types of alerts that may not make sense compared to production environments. For example, we may run our non-production systems on a lower tier of infrastructure compared to our production systems, and so an alert based on the application's response time may need different thresholds to account for the lower expected performance. But in contrast to non-production environments, we may consider it to be important to collect a lot of telemetry in case our testers do find any issues and we need to diagnose them interactively, so we may allow for higher levels of telemetry sampling than we would in a production environment.

  • Development environments may only need minimal instrumentation. Typically in development environments I'll deploy all of the telemetry collection that I would deploy for non-production environments, but turn all alerts and dashboards off. In the event of any issues, I'll be interactively working with the telemetry myself anyway.

Of course, your specific needs may be different, but in general I think it's good to categorise our instrumentation across types of environments. For example, here is how I might typically deploy Application Insights components across environments:

Instrumentation Type Development NonProduction Production
Application Insights smart detection Off On, sending alerts to developers On, sending alerts to production monitoring group
Application Insights Azure Functions integration On On On
Application Insights App Services integration On On On
Application Insights API Management integration On, at 100% sampling On, at 100% sampling On, at 30% sampling

Once we've determined those basic rules, we can then implement them. In the case of ARM templates, I tend to use ARM template parameters to handle this. As we go through this series we'll see examples of how we can use parameters to achieve this conditional logic. I'll also present versions of this table with my suggestions for the components that you might consider deploying for each environment.

Configuring Smart Detection through ARM Templates

Now that we have a basic idea of how we'll configure instrumentation in each environment, we can reconsider how we might configure Application Insights. Typically I suggest deploying a single Application Insights instance for each environment the system will be deployed into. If we're building up a complex ARM template with all of the system's components, we can embed the conditional logic required to handle different environments in there.

Here's a large ARM template that includes everything we've created in this post, and has the three environment type modes:


Application Insights is a very useful tool for monitoring our application components. It collects telemetry from a range of different sources, which can all be automated. It provides automatic analysis of some of our data and has smart detection features, which again we can configure through our automation scripts. Furthermore, we can publish data into it ourselves as well. In fact, in the next post this series, we'll discuss how we can publish custom metrics into Application Insights.

Automating Azure Instrumentation and Monitoring - Part 1: Introduction

This post was originally published on the Kloud blog.

Instrumentation and monitoring is a critical part of managing any application or system. By proactively monitoring the health of the system as a whole, as well as each of its components, we can mitigate potential issues before they affect customers. And if issues do occur, good instrumentation alerts us to that fact so that we can respond quickly.

Azure provides a set of powerful monitoring and instrumentation tools to instrument almost all Azure services as well as our own applications. By taking advantage of these tools we can can improve the quality of our systems. However, there isn't a lot of documentation on how to script and automate the instrumentation components that we build. Alerts, dashboards, and other instrumentation components are important parts of our systems and deserve as much attention as our application code or other parts of our infrastructure. In this series, we'll cover many of the common types of instrumentation used in Azure-hosted systems and will outline how many of these can be automated, usually with a combination of ARM templates and scripting. The series consists of nine parts:

  • Part 1 (this post) provides an introduction to the series by describing why we should instrument our systems, outlines some of the major tools that Azure provides such as Azure Monitor, and argues why we should be adopting an 'infrastructure as code' mindset for our instrumentation and monitoring components.

  • Part 2 describes Azure Application Insights, including its proactive detection and alert features. It also outlines a pattern for deploying instrumentation components based on the requirements we might typically have for different environments, from short-lived development and test environments through to production.

  • Part 3 discusses how to publish custom metrics, both through Application Insights and to Azure Monitor. Custom metrics let us enrich the data that is available to our instrumentation components.

  • Part 4 covers the basics of alerts and metric alerts. Azure Monitor's powerful alerting system is a big topic, and in this part we'll discuss how it works overall, as well as how to get alerts for built-in and custom metrics.

  • Part 5 covers log alerts and resource health alerts, two other major types of alerts that Azure Monitor provides. Log alerts let us alert on information coming into Application Insights logs and Log Analytics workspaces, while resource health alerts us when Azure itself is having an issue that may result in downtime or degraded performance.

  • Part 6 (coming soon) describes dashboards. The Azure Portal has a great dashboard UI, and our instrumentation data can be made available as charts. Dashboards are also possible to automate, and I'll show a few tips and tricks I've learned when doing this.

  • Part 7 (coming soon) covers availability tests, which let us proactively monitor our web applications for potential outages. We'll discuss deploying and automating both single-step (ping) and multi-step availability tests.

  • Part 8 (coming soon) describes autoscale. While this isn't exactly instrumentation in and of itself, autoscale is built on much of the same data used to drive alerts and dashboards, and autoscale rules can be automated as well.

  • Finally, part 9 (coming soon) covers exporting data to other systems. Azure Monitor metrics and log data can be automatically exported, as can Application Insights data, and the export rules can be exported and used from automation scripts.

While the posts will cover the basics of each of these topics, the focus will be on deploying and automating each of these components. I'll provide links to more details on the inner workings where needed to supplement the basic overview I'll provide. Also, I'll assume some basic familiarity with ARM templates and PowerShell.

Let's start by reviewing the landscape of instrumentation on Azure.

Azure's Instrumentation Platform

As Azure has evolved, it's built up an increasingly comprehensive suite of tools for monitoring the individual components of a system as well as complete systems as a whole. The key piece of the Azure monitoring puzzle is named, appropriately enough, Azure Monitor. Azure Monitor is a built-in service that works with almost all Azure services. Many of its features are free. It automatically captures telemetry, consolidates it, and makes the data available for interactive querying as well as for a variety of other purposes that we'll discuss throughout the series.

This isn't quite the whole story, though. While Azure Monitor works well most of the time, and it appears to be the strategic direction that Azure is heading in, there are a number of exceptions, caveats, and complexities - and these become more evident when you try to automate it. I'll cover some of these in more detail below.


Metrics are numeric values that represent a distinct piece of information about a component at a point in time. The exact list of metrics depends on what makes sense for a given service. For example, a virtual machine publishes metrics for the CPU and memory used; a SQL database has metrics for the number of connections and the database throughput units used; a Cosmos DB account publishes metrics for the number of requests issued to the database engine; and an App Service has metrics for the number of requests flowing through. There can be dozens of different metrics published for any given Azure service, and they are all documented for reference. We'll discuss metrics in more detail throughout the series, as there are some important things to be aware of when dealing with metrics.

As well as Azure Monitor's metrics support, some Azure services have their metrics systems. For example, SQL Azure has a large amount of telemetry that can be accessed through dynamic management views. Some of the key metrics are also published into Azure Monitor, but if you want to use metrics that are only available in dynamic management views then you won't be able to use the analysis and processing features of Azure Monitor. We'll discuss a potential workaround for this in part 3 of this series.

A similar example is Azure Storage queues. Azure Storage has an API that can be used to retrieve the approximate number of messages sitting in a queue, but this metric isn't published into Azure Monitor and so isn't available for alerting or dashboarding. Again, we'll discuss a potential workaround for this in part 3 of this series.

Nevertheless, in my experience, almost all of the metrics I work with on a regular basis are published through Azure Monitor, and so in this series we'll predominantly focus on these.


Logs are structured pieces of data, usually with a category, a level, and a textual message, and often with a lot of additional contextual data as well. Broadly speaking, there are several general types of logs that Azure deals with:

  • Resource activity logs are essentially the logs for management operations performed on Azure resources through the Azure Resource Management (ARM) API, and a few other types of management-related logs. They can be interactively queried using the Azure Portal blades for any resource, as well as resource groups and subscriptions. You can typically view these by looking at the Activity log tab from any Azure resource blade in the portal. Activity logs contain all write operations that pass through the ARM API. If you use the ARM API directly, or indirectly through the Azure Portal, CLI, PowerShell, or anything else, you'll see logs appear in here. More details on activity logs is available here.

  • Azure AD activity logs track Active Directory sign-ins and management actions. These can be viewed from within the Azure AD portal blade. We won't be covering Azure AD much in this series, but you can read more detail about Azure AD logs here.

  • Diagnostic logs are published by individual Azure services. They provide information about the actions and work that the service itself is doing. By default these are not usually available for interactive querying. Diagnostic logs often work quite differently between different services. For example, Azure Storage can publish its own internal logs into a $logs blob container; App Services provides web server and application logs and can save these to a number of different places as well as view them in real time; and Azure SQL logs provide a lot of optional diagnostic information and again have to be explicitly enabled.

  • Application logs are written by application developers. These can be sent to a number of different places, but a common destination is Application Insights. If logs are published into Application Insights they can be queried interactively, and used as part of alerts and dashboards. We'll discuss these in more detail in later parts of this series.

Azure Log Analytics is a central log consolidation, aggregation, and querying service. Some of the above logs are published automatically into Log Analytics, while others have to be configured to do so. Log Analytics isn't a free service, and needs to be provisioned separately if you want to configure logs to be sent into it. We'll discuss it more detail throughout this series.

Ingestion of Telemetry

Azure services automatically publish metrics into Azure Monitor, and these built-in metrics are ingested free of charge. Custom metrics can also be ingested by Azure Monitor, which we'll discuss in more detail in part 3 of this series.

As described in the previous section, different types of logs are ingested in different ways. Azure Monitor automatically ingests resource activity logs, and does so free of charge. The other types of logs are not ingested by Azure Monitor unless you explicitly opt into that, either by configuring Application Insights to receive custom logs, or by provisioning a Log Analytics workspace and then configuring your various components to send their logs to that.

Processing and Working With Telemetry

Once data has been ingested into Azure Monitor, it becomes available for a variety of different purposes. Many of these will be discussed in later parts of this series. For example, metrics can be used for dashboards (see part 6, coming soon) and for autoscale rules (see part 8, coming soon); logs that have been routed to Azure Monitor can be used as part of alerts (see part 5, coming soon); and all of the data can be exported (see part 9, coming soon).

Application Insights

Application Insights has been part of the Azure platform for around two years. Microsoft recently announced that it is considered to be part of the umbrella Azure Monitor service. However, Application Insights is deployed as a separate service, and is billable based on the amount of data it ingests. We'll cover Application Insights in more detail in part 2 of this series.

Summary of Instrumentation Components

There's a lot to take in here! The instrumentation story across Azure isn't always easy to understand, and although the complexity is reducing as Microsoft consolidates more and more of these services into Azure Monitor, there is still a lot to unpack. Here's a very brief summary:

Azure Monitor is the primary instrumentation service we generally interact with. Azure Monitor captures metrics from every Azure service, and it also captures some types of logs as well. More detailed diagnostic and activity logging can be enabled on a per-service or per-application basis, and depending on how you configure it, it may be routed to Azure Monitor or somewhere else like an Azure Storage account.

Custom data can be published into Azure Monitor through custom metrics (which we'll cover in part 3 of the series), through publishing custom logs into Log Analytics, and through Application Insights. Application Insights is a component that is deployed separately, and provides even more metrics and logging capabilities. It's built off the same infrastructure as the rest of Azure Monitor and is mostly queryable from the same places.

Once telemetry is published into Azure Monitor it's available for a range of different purposes including interactive querying, alerting, dashboarding, and exporting. We'll cover all of these in more detail throughout the series.

Instrumentation as Infrastructure

The idea of automating all of our infrastructure - scripting the setup of virtual machines or App Services, creating databases, applying schema updates, deploying our applications, and so forth - has become fairly uncontroversial. The benefits are so compelling, and the tools are getting so good, that generally most teams don't take much convincing that expressing their infrastructure as code is worthwhile. But in my experience working with a variety of customers, I've found that this often isn't the case with instrumentation.

Instrumentation components like dashboards, alerts, and availability tests are still frequently seen as being of a different category to the rest of an application. While it may seem perfectly reasonable to script out the creation of some compute resources, and for these scripts to be put into a version control system and built alongside the app itself, instrumentation is frequently handled manually and without the same level of automation rigour as the application code and scripts. As I'll describe below, I'm not opposed to using the Azure Portal and other similar tools to explore the metrics and logs associated with an application. But I believe that the instrumentation artifacts that come out of this exploration - saved queries, dashboard widgets, alert rules, etc - are just as important as the rest of our application components, and should be treated with the same level of diligence.

As with any other type of infrastructure, there are some clear benefits to expressing instrumentation components as code compared to using the Azure Portal including:

  • Reducing risk of accidental mistakes: I find that expressing my instrumentation logic explicitly in code, scripts, or ARM templates makes me far less likely to make a typo, or to do something silly like confuse different units of measurement when I'm setting an alert threshold.

  • Peer review: For teams that use a peer review process in their version control system, treating infrastructure as code means that someone else on the team is expected to review the changes I'm making. If I do end up making a dumb mistake then it's almost always caught by a coworker during a review, and and even if there are no mistakes, having someone else on the team review the change means that someone else understands what's going on.

  • Version control: Keeping all of our instrumentation logic and alert rules in a version control system is helpful when we want to understand how instrumentation has evolved over time, and for auditability.

  • Keeping related changes together: I'm a big fan of keeping related changes together. For example, if I create a pull request to add a new application component then I can add the application code, the deployment logic, and the instrumentation for that new component all together. This makes it easier to understand the end-to-end scope of the feature being added. If we include instrumentation in our 'definition of done' for a feature then we can easily see that this requirement is met during the code review stage.

  • Managing multiple environments: When instrumentation rules and components aren't automated, it's easy for them to get out of sync between environments. In most applications there is at least one dev/test environment as well as production. While it might seem unnecessary to have alerts and monitoring in a dev environment, I will argue in part 2 of this series that it's important to do so, even if you have slightly different rules and thresholds. Deploying instrumentation as code means that these environments can be kept in sync. Similarly, you may deploy your production environment to multiple regions for georedundancy or for performance reasons. If your instrumentation components are kept alongside the rest of your infrastructure, you'll get the same alerts and monitoring for all of your regions.

  • Avoid partial automation: In my experience, partially automating an application can sometimes result in more complexity than not automating it at all. For example, if you use ARM templates and (as I typically suggest) use the 'complete' deployment mode, then any components you may have created manually through the Azure Portal can be removed. Many of the instrumentation components we'll discuss are ARM resources and so can be subject to this behaviour. Therefore, a lack of consistency across how we deploy all of our infrastructure and instrumentation can result in lost work, missed alerts, hard-to-find bugs, and generally odd instrumentation behaviour.

Using the Azure Portal

Having an instrumentation-first mindset doesn't mean that we can't or shouldn't ever use the Azure Portal. In fact, I tend to use it quite a lot - but for specific purposes.

First, I tend to use it a lot for interactively querying metrics and logs in response to an issue, or just to understand how my systems are behaving. I'll use Metrics Explorer to create and view charts of potentially interesting metrics, and I'll write log queries and execute them from Application Insights or Log Analytics.

Second, when I'm responding to alerts, I'll make use of the portal's tooling to view details, track the status of the alert, and investigate what might be happening. We'll discuss alerts more later in this series.

Third, I use the portal for monitoring my dashboards. We'll talk about dashboards in part 6 (coming soon). Once they're created, I'll often check on them to make sure that all of my metrics look to be in a normal range and that everything appears healthy.

Fourth, when I'm developing new alerts, dashboard widgets, or other components, I'll create test resources using the portal. I'lll use my existing automation scripts to deploy a short-term copy of my environment temporarily, then deploy a new alert or autoscale rule using the portal, and then export them to an ARM template or manually construct a template based on what gets deployed by the portal. This way I can see how things should work, and get to use the portal's built-in validation and assistance with creating the components, but still get everything into code form eventually. Many of the ARM templates I'll provide throughout this series were created in this way.

Finally, during an emergency - when a system is down, or something goes wrong in the middle of the night - I'll sometimes drop the automation-first requirement and create alerts on the fly, even on production, but knowing that I'll need to make sure I add it into the automation scripts as soon as possible to ensure everything stays in sync.


This post has outlined the basics of Azure's instrumentation platform. The two main types of data we tend to work with are metrics and logs. Metrics are numerical values that represent the state of a system at a particular point in time. Logs come in several variants, some of which are published automatically and some of which need to be enabled and then published to a suitable location before they can be queried. Both metrics and logs can be processed by Azure Monitor, and over the course of this series we'll look at how we can script and automate the ingestion, processing, and handling of a variety of types of instrumentation data.

Automation of Azure Monitor and other instrumentation components is something that I've found to be quite poorly documented, so in writing this series I've aimed to provide both explanations of how these parts can be built, and set of sample ARM templates and scripts that you can adapt to your own environment.

In the next part we'll discuss Application Insights, and some of the automation we can achieve with that. We'll also look at a pattern I typically use for deploying different levels of instrumentation into different environments.

Integration Testing Precompiled v2 Azure Functions

This post was originally published on the Kloud blog.

Azure Functions code can often contain important functionality that needs to be tested. The two most common ways of testing code are unit testing and integration testing. Unit testing runs pieces of code in isolation, and this is relatively simple to do with Azure Functions. Integration testing can be a little trickier though, and I haven't found any good documentation about how do this with version 2 of the Functions runtime. In this post I'll outline the approach I'm using to run integration tests against my Azure Functions v2 code.

In an application with a lot of business logic, unit testing may get us most of the way to verifying the code's quality. But Azure Functions code often involves pieces of functionality that can't be easily unit tested. For example, triggers, input and output bindings are very powerful features that let us avoid writing boilerplate code to bind to HTTP requests, connect to Azure Storage and Service Bus blobs, queues, and tables, or building our own timer logic. Similarly, we may need to connect to external services or databases, or work with libraries that can't be easily mocked or faked. If we want to test these parts of our Functions apps then we need some form of integration testing.

Approaches to Integration Testing

Integration tests involve running code in as close to a real environment as is practicable. The tests are generally run from our development machines or build servers. For example, ASP.NET Core lets us host an in-memory server for our application, which we can then connect to real databases, in-memory versions of systems like the Entity Framework, or emulators for services like Azure Storage and Cosmos DB.

Azure Functions v1 included some features to support integration testing of script-based Functions apps. But to date, I haven't found any guidance on how to run integration tests using a precompiled .NET Azure Functions app running against the v2 runtime.

Example Functions App

For the purposes of this post I've written a very simple Functions App with two functions that illustrate two common use cases. One function (HelloWorld) receives an HTTP message and returns a response, and the second (HelloQueue) receives an HTTP message and writes a message to a queue. The actual functions themselves are really just simple placeholders based on the Visual Studio starter function template:

In a real application you're likely to have a lot more going on than just writing to a queue, but the techniques below can be adapted to cover a range of different scenarios.

You can access all the code for this blog post on GitHub.

Implementing Integration Tests

The Azure Functions v2 core tools support local development and testing of Azure Functions apps. One of the components in this set of tools is func.dll, which lets us host the Azure Functions runtime. By automating this from our integration test project we can start our Functions app in a realistic host environment, run our tests, and tear it down again. This is ideal for running a suite of integration tests.

While you could use any test framework you like, the sample implementation I've provided uses xUnit.

Test Collection and Fixture

The xUnit framework provides a feature called collection fixtures. These fixtures let us group tests together; they also let us run initialisation code before the first test runs, and run teardown code after the last test finishes. Here's a placeholder test collection and fixture that will support our integration tests:

Starting the Functions Host

Now we have the fixture class definition, we can use it to start and stop the Azure Functions host. We will make use of the System.Diagnostics.Process class to start and stop the .NET Core CLI (dotnet.exe), which in turn starts the Azure Functions host through the func.dll library.

Note: I assume you already have the Azure Functions v2 core tools installed. You may already have these if you've got Visual Studio installed with the Azure Functions feature enabled. If not, you can install it into your global NPM packages by using the command npm install -g azure-functions-core-toolsas per the documentation here.

Our fixture code looks like this:

The code is fairly self-explanatory: during initialiation it reads various paths from configuration settings and starts the process; during teardown it kills the process and disposes of the Process object.

I haven't added any Task.Delay code, or anything to poll the function app to check if it's ready to receive requests. I haven't found this to be necessary. However, if you find that the first test run in a batch fails, this might be something you want to consider adding at the end of the fixture initialisation.


Some of the code in the above fixture file won't compile yet because we have some configuration settings that need to be passed through. Specifically, we need to know the path to the dotnet.exe file (the .NET Core CLI), the func.dll file (the Azure Functions host), and to our Azure Functions app binaries.

I've created a class called ConfigurationHelper that initialises a static property that will help us:

The Settings class is then defined with the configuration settings we need:

Then we can create an appsettings.json file to set the settings:

That's all we need to run the tests. Now we can write a couple of actual test classes.

Writing Some Tests

You can write your integration tests in any framework you want, whether that be something very simple like pure xUnit tests or a more BDD-oriented framework like SpecFlow.

Personally I like BDDfy as a middle ground - it's simple and doesn't require a lot of extra plumbing like SpecFlow, while letting us write BDD-style tests in pure code.

Here are a couple of example integration tests I've written for the sample app:

Test the HelloWorld Function

This test simply calls the HTTP-triggered HelloWorld function and checks the output is as expected.

Test the HelloQueue Function

The second test checks that the HelloQueue function posts to a queue correctly. It does this by clearing the queue before it runs, letting the HelloQueue function run, and then confirming that a single message - with the expected contents - has been enqueued.

Running the Tests

Now we can compile the integration test project and run it from Visual Studio's Test Explorer. Behind the scenes it runs the .NET Core CLI, starts the Azure Functions host, executes our tests, and then kills the host when they're finished. And we can see the tests pass!

Running from Azure Pipelines

Getting the tests running from our local development environment is great, but integration tests are most useful when they run automatically as part of our continuous integration process. (Sometimes integration tests take so long to run that they get relegated to run on nightly builds instead, but that's a topic that's outside the scope of this post.)

Most build servers should be able to run our integration tests without any major problems. I'll use the example here of Azure Pipelines, which is part of the Azure DevOps service (and which used to be called VSTS's Build Configurations feature). Azure Pipelines lets us define our build process as a YAML file, which is also a very convenient way to document and share it!

Here's a build.yaml for building our Azure Functions app and running the integration tests:

The three key parts here are:

  • Lines 3 and 4 override the FunctionHostPath app setting with the location that Azure Pipelines hosted agents use for NPM packages, which is different to the location on most developers' PCs.

  • Line 6 links the build.yaml with the variable group IntegrationTestConnectionStrings. Variable groups are outside the scope of this post, but briefly, they let us create a predefined set of variables that are available as environment variables. Inside the IntegrationTestConnectionStrings variable group, I have set two variables - AzureWebJobsStorage and StorageConnectionString - to a connection string for an Azure Storage account that I want to use when I run from the hosted agent.

  • Lines 16 through 21 install the Azure Functions Core Tools, which gives us the func.dll host that we use. For Azure Pipelines hosted agents we need to run this step every time we run the build since NPM packages are reset after each build completes.

  • Lines 23 through 28 use the dotnet test command, which is part of the .NET Core tooling, to execute our integration tests. This automatically publishes the results to Azure DevOps too.

When we run the build, we can see the tests have run successfully:


And not only that, but we can see the console output in the build log, which can be helpful when diagnosing any issues we might see:

I've found this approach to be really useful when developing complex Azure Functions apps, where integration testing is a part of the quality control necessary for functions that run non-trivial workloads.

Remember you can view the complete code for this post on GitHub.

Update: A second post, adapting this method to test timer-triggered functions, is now available too. TODO

Creating Azure Storage SAS Tokens with ARM Templates

This post was originally published on the Kloud blog.

Shared access signatures, sometimes also called SAS tokens, allow for delegating access to a designated part of an Azure resource with a defined set of permissions. They can be used to allow various types of access to your Azure services while keeping your access keys secret.

In a recent update to Azure Resource Manager, Microsoft has added the ability to create SAS tokens from ARM templates. While this is a general-purpose feature that will hopefully work across a multitude of Azure services, for now it only seems to work with Azure Storage (at least of the services I've checked). In this post I'll explain why this is useful, and give some example ARM templates that illustrate creating both account and service SAS tokens.

Use Cases

There are a few situations where it's helpful to be able to create SAS tokens for an Azure Storage account from an ARM template. One example is when using the Run From Package feature - an ARM template can deploy an App Service, deploy a storage account, and create a SAS token for a package blob - even if it doesn't exist at deployment time.

Another example might be for an Azure Functions app. A very common scenario is for a function to receive a file as input, transform it in some way, and save it to blob storage. Rather than using the root access keys for the storage account, we could create a SAS token and add it to the Azure Functions app's config settings, like in this example ARM template.

How It Works

ARM templates now support a new set of functions for generating SAS tokens. For Azure Storage, there are two types of SAS tokens - account and service - and the listAccountSas and listServiceSas functions correspond to these, respectively.

Behind the scenes, these functions invoke the corresponding functions on the Azure Storage ARM provider API. Somewhat confusingly, even though the functions' names start with list, they are actually creating SAS tokens and not listing or working with previously created tokens. (In fact, SAS tokens are not resources that Azure can track, so there's nothing to list.)


Int his post I'll show ARM templates that use a variety of types of SAS for different types of situations. For simplicity I've used the outputs section to emit the SASs, but these could easily be passed through to other parts of the template such as App Service app settings (as in this example). Also, in this post I've only dealt with blob storage SAS tokens, but the same process can easily be used for queues and tables, and even let us restrict token holders to only access to table partitions or sets of rows.

Creating a Service SAS

By using a service SAS, we can grant permissions to a specific blob in blob storage, or to all blobs within a container. Similarly we can grant permissions to an individual queue, or to a subset of entities within a table. More documentation on constructing service SASs is available hereand the details of the listServiceSas function is here.

Service SASs require us to provide a canonicalizedResource, which is just a way of describing the scope of the token. We can use the path to an individual blob by using the form /blob/, such as /blob/mystorageaccountname/images/cat.jpeg or /blob/mystorageaccountname/images/cats/fluffy.jpeg. Or, we can use the path to a container by using the form /blob/mystorageaccountname/images. The examples below show different types of canonicalizedResource property values.

Read Any Blob Within a Container

Our first SAS token will let us read all of the blobs within a given container. For this, we'll need to provide four properties.

First, we'll provide a canonicalizedResource. This will be the path to the container we want to allow the holder of the token to read from. We'll construct this field dynamically based on the ARM template input.

Second, we need to provide a signedResource. This is the type of resource that the token is scoped to. In this case, we're creating a token that works across a whole blob container and so we'll use the value c.

Third, we provide the signedPermission. This is the permission, or set of permissions, that the token will allow. There are different permissions for reading, creating, and deleting blobs and other storage entities. Importantly, listing is considered to be separate to reading, so bear this in mind too. Because we just want to allow reading blobs, we'll use the value r.

Finally, we have to provide an expiry date for the token. I've set this to January 1, 2050.

When we execute the ARM template with the listServiceSas function, we need to provide these values as an object. Here's what our object looks like:

"serviceSasFunctionValues": {
    "canonicalizedResource": "[concat('/blob/', parameters('storageAccountName'), '/', parameters('containerName'))]",
    "signedResource": "c",
    "signedPermission": "r",
    "signedExpiry": "2050-01-01T00:00:00Z"

And here's the ARM template - check out line 62, where the listServiceSas function is actually invoked.

Read A Single Blob

In many cases we will want to issues SAS tokens to only read a single blob, and not all blobs within a container. In this case we make a couple of changes from the first example.

First, our canonicalizedResource now will have the path to the individual blob, not to a container. Again, we'll pull these from the ARM template parameters.

Second, the signedResource is now a blob rather than a container, so we use the value b instead of c.

Here's our properties object for this SAS token:

"serviceSasFunctionValues": {
    "canonicalizedResource": "[concat('/blob/', parameters('storageAccountName'), '/', parameters('containerName'), parameters('blobName'))]",
    "signedResource": "b",
    "signedPermission": "r",
    "signedExpiry": "2050-01-01T00:00:00Z"

And here's the full ARM template:

Write A New Blob

SAS tokens aren't just for reading, of course. We can also create a token that will allow creating or writing to blobs. One common use case is to allow the holder of a SAS to create a new blob, but not to overwrite anything that already exists. To create a SAS for this scenario, we work back at the container level again - so the canonicalizedResource property is set to the path to the container, and the signedResource is set to c. This time, we set signedPermission to c to allow for blobs to be created. (If we wanted to also allow overwriting blobs, we could do this by setting signedPermission to cw.)

Here's our properties object:

"serviceSasFunctionValues": {
  "canonicalizedResource": "[concat('/blob/', parameters('storageAccountName'), '/', parameters('containerName'))]",
  "signedResource": "c",
  "signedPermission": "c",
  "signedExpiry": "2050-01-01T00:00:00Z"

And here's an ARM template:

Creating an Account SAS

An account SAS works at the level of a storage account, rather than at the item-level like a service SAS. For the majority of situations you will probably want a service SAS, but account SASs can be used for situations where you want to allow access to all blobs within a storage account, or if you want to allow for the management of blob containers, tables, and queues. More detail on what an account SAS can do is available here.

More documentation on constructing account SASs is available hereand the details of the listAccountSas function is here.

Read All Blobs in Account

We can use an account SAS to let us read all of the blobs within a storage account, regardless of the container they're in. For this token we need to use signedServices = b to indicate that we're granting permissions within blob storage, and we'll use signedPermission = r to let us read blobs.

The signedResourceTypes parameter is available on account SAS tokens but not on service SAS tokens, and it lets us specify the set of APIs that can be used. We'll use o here since we want to read all blobs, and blobs are considered to be objects, and reading blobs would involve object-level API calls. There are also two other values for this parameter - s indicates service-level APIs (such as creating new blob containers), and c indicates container-level APIs (such as deleting a blob container that already exists). You can see the APIs available within each category in the Azure Storage documentation.

So our SAS token will be generated with the following properties:

"accountSasFunctionValues": {
    "signedServices": "b",
    "signedPermission": "r",
    "signedResourceTypes": "o",
    "signedExpiry": "2050-01-01T00:00:00Z"

Here's the full ARM template that creates this SAS:

List Blob Containers in Account

Finally, let's create a SAS token that will allow us to list the blob containers within our account. For this token we need to use signedServices = b again, but this time we'll use signedPermission = l (since l indicates permission to list).

Somewhat non-intuitively, the API that lists containers is considered a service-level API, so we need to use signedResourceTypes = s.

This means the parameters we're using to generate a SAS token are as follows:

"accountSasFunctionValues": {
    "signedServices": "b",
    "signedPermission": "l",
    "signedResourceTypes": "s",
    "signedExpiry": "2050-01-01T00:00:00Z"

Here's the full ARM template that creates this SAS:

You can test this by executing a GET request against this URL: https://.

Other Services

Currently it doesn't appear that other services that use SAS tokens support this feature. For example, Service Bus namespacesEvent Hubs namespaces, and Cosmos DB database accounts don't support any list* operations on their resource provider APIs that would allow for creating SAS tokens. Hopefully these will come soon, since this feature is very powerful and will allow for ARM templates to more self-contained.

Also, I have noticed that some services don't like the listAccountSas function being embedded in their resource definitions. For example, Azure Scheduler (admittedly soon to be retired) seems to have a bug where it doesn't like SAS tokens generated in this way for scheduler jobs. However, I have used this feature in other situations without any such issues.

Deploying App Services with 'Run From Package', Azure Storage, and Azure Pipelines

This post was originally published on the Kloud blog.

Azure App Service recently introduced a feature called Run From Package. Rather than uploading our application binaries and other files to an App Service directly, we can instead package them into a zip file and provide App Services with the URL. This is a useful feature because it eliminates issues with file locking during deployments, it allows for atomic updates of application code, and it reduces the time required to boot an application. It also means that the 'release' of an application simply involves the deployment of a configuration setting. And because Azure Functions runs on top of App Services, the same technique can be used for Azure Functions too.

While we can store the packages anywhere on the internet, and then provide a URL to App Services to find them, a common approach is to use Azure Storage blobs to host the packages. Azure Storage is a relatively cheap and easy way to host files and get URLs to them, making it perfect for this type of deployment. Additionally, by permanently storing our packages in an Azure Storage account we can keep a permanent record of all deployments - and we can even use feature of Azure Storage like immutable blobs to ensure that the blobs can't be tampered with or deleted.

However there are a few different considerations that it's important to think through when using storage accounts in conjunction with App Services. In this post I'll describe one way to use this feature with Azure DevOps build and release pipelines, and some of the pros and cons of this approach.

Storage Account Considerations

When provisioning an Azure Storage account to contain your application packages, there are a few things you should consider.

First, I'd strongly recommend using SAS tokens in conjunction with a container access policy to ensure that your packages can't be accessed by anyone who shouldn't access them. Typically application packages that are destined for an App Service aren't files you want to be made available to everyone on the internet.

Second, consider the replication options you have for your storage account. For something that runs an App Service I'd generally recommend using the RA-GRS replication type to ensure that even if Azure Storage has a regional outage App Services can still access the packages. This needs to be considered as part of a wider disaster recovery strategy though, and also remember that in the event that the primary region for your Azure Storage account is unavailable, you need to do some manual work to switch your App Service to read your package from the secondary region.

Third, virtual network integration is not currently possible for storage accounts that are used by App Services, although it should be soon. Today, Azure allows joining an App Service into a virtual network, but service endpoints - the feature that allows blocking access to Azure Storage outside of a virtual network - aren't supported by App Services yet. This feature is in preview, though, so I'm hopeful that we'll be able to lock down the storage accounts used by App Services packages soon.

Fourth, consider who deploys the storage account(s), and how many you need. In most cases, having a single storage account per App Service is going to be too much overhead to maintain. But if you decide to share a single storage account across many App Services throughout your organisation, you'll need to consider how to share the keys to generate SAS tokens.

Fifth, consider the lifetime of the application versus the storage account. In Azure, resource groups provide a natural boundary for resources that share a common lifetime. An application and all of its components would typically be deployed into a single resource group. If we're dealing with a storage account that may be used across multiple applications, it would be best to have the storage account in its own resource group so that decommissioning one application won't result in the storage account being removed, and so that Azure RBAC permissions can be separately assigned.

Using Azure Pipelines

Azure Pipelines is the new term for what used to be called VSTS's build and release management features. Pipelines let us define the steps involved in building, packaging, and deploying our applications.

There are many different ways we can create an App Service package from Azure Pipelines, upload them to Azure Storage, and then deploy them. Each option has its own pros and cons, and the choice will often depending on how pure you want your build and release processes to be. For example, I prefer that my build processes are completely standalone, and don't interact with Azure at all if possible. They emit artifacts and the release process then manages the communication with Azure to deploy them. Others may not be as concerned about this separation, though.

I also insist on 100% automation for all of my build and release tasks, and vastly prefer to script things out in text files and commit them to a version control system like Git, rather than changing options on a web portal. Therefore I typically use a combination of build YAMLs, ARM templates, and PowerShell scripts to build and deploy applications.

It's fairly straightforward to automate the creation, upload, and deployment of App Service packages using these technologies. In this post I'll describe one approach that I use to deploy an App Service package, but towards the end I'll outline some variations you might want to consider.


I'll use the hypothetical example of an ASP.NET web application. I'm not going to deploy any actual real application here - just the placeholder code that you get from Visual Studio when creating a new ASP.NET MVC app - but this could easily be replaced with a real application. Similarly, you could replace this with a Node.js or PHP app, or any other language and framework supported by App Service.

You can access the entire repository on GitHub here, and I've included the relevant pieces below.

Step 1: Deploy Storage Account

The first step is to deploy a storage account to contain the packages. My preference is for a storage account to be created separately to the build and release process. As I noted above, this helps with reusability - that account can be used for any number of applications you want to deploy - and it also ensures that you're not tying the lifetime of your storage account to the lifetime of your application.

I've provided an ARM template below that deploys a storage account with a unique name, and creates a blob container called packages that is used to store all of our packages. The template also outputs the connection details necessary to upload the blobs and generate a SAS token in later steps.

Step 2: Build Definition

Our Azure Pipelines build definition is pretty straightforward. All we do here is build our app and publish a build artifact. In Azure Pipelines, 'publishing' a build artifact means that it's available for releases to use - the build process doesn't actually save the package to the Azure Storage account. I've used a build YAML file to define the process, which is provided here:

Step 3: Release Definition

Our release definition runs through the remaining steps. I've provided a PowerShell script that executes the release process:

First, it uploads the build artifacts to our storage account's packages container. It generates a unique filename for each blob, ensuring that every release is fully independent and won't accidentally overwrite another release's package.

Next, it generates a SAS token for the packages it's uploaded. The token just needs to provide read access to the blob. I've used a 100-year expiry, but you could shorten this if you need to - just make sure you don't make it too short, or App Services won't be able to boot your application once the expiry date passes.

Finally, it deploys the App Service instance using an ARM template, and passes the full URL to the package - including its SAS token - into the ARM deployment parameters. The key part for the purposes of this post is on line 51 of the template, where we create a new app setting called WEBSITE_RUN_FROM_PACKAGE and set it to the full blob URL, including the SAS token part. Here's the ARM template we execute from the script:

Note that if you want to use this PowerShell script in your own release process, you'll want to adjust the variables so that you're using the correct source alias name for the artifact, as well as the correct resource group name.

Pros and Cons

The approach I've outlined here has a few benefits: it allows for a storage account to be shared across multiple applications; it keeps the build process clean and simple, and doesn't require the build to interact with Azure Storage; and it ensures that each release runs independently, uploading its own private copy of a package.

However there are a few problems with it. First, the storage account credentials need to be available to the release definition. This may not be desirable if the account is shared by multiple teams or multiple applications. Second, while having independent copies of each package is useful, it also means there's some wasted space if we deploy a single app multiple times.

If these are concerns to you, there are a number of things you could consider, depending on your concern.

If your concern is that credentials are being shared, then you could consider creating a dedicated storage account as part of the release process. The release process can provision the storage account (if it doesn't already exist), retrieve the keys to it, upload the package, generate a SAS token, and then deploy the App Service with the URL to the package. The storage account's credentials would never leave the release process. Of course, this also makes it harder to share the storage account across multiple applications.

Keeping the storage account with the application also makes the release more complicated, since you can no longer deploy everything in a single ARM template deployment operation. You'd need at least two ARM deployments, with some scripting required in between. The first ARM template deployment would deploy the storage account and container. You'd then execute some PowerShell or another script to upload the package and generate a SAS token. Then you could execute a second ARM template deployment to deploy the App Service and point it to the package URL.

Another alternative is to pre-create SAS tokens for your deployments to use. One SAS token would be used for the upload of the blobs (and would therefore need write permissions assigned), while a second would be used for the App Service to access all blobs within the container (and would only need read permissions assigned).

Yet another alternative is to use the preview Azure RBAC feature of Azure Storage to authenticate the release process to the storage account. This is outside the scope of this post, but this approach could be used to delegate permissions to the storage account without sharing any account keys.

If your concern is that the packages may be duplicated, you have a few options. One is to simply not create unique names during each release, but instead use a naming scheme that results in consistent names for the same build artifacts. For example, you might use the convention .zip. Subsequent releases could check if the package already exists and leave it alone if it does.

If you don't want to use Azure Storage at all, you can also upload a package directly to the App Service's d:\home\data\SitePackages folder. This way you gain some of the benefits of the Run From Package feature - namely the speed and atomicity of deployments - but lose the advantage of having a simpler deployment with immutable blobs. This is documented on the official documentation page. Also, you can of course use any file storage system you like, such as Amazon S3, to host your packages.

Also, bear in mind that currently App Services on Linux don't support the Run from Package feature at all currently.

Automatic Key Rotation for Azure Services

This post was originally published on the Kloud blog.

Securely managing keys for services that we use is an important, and sometimes difficult, part of building and running a cloud-based application. In general I prefer not to handle keys at all, and instead rely on approaches like managed service identities with role-based access control, which allow for applications to authenticate and authorise themselves without any keys being explicitly exchanged. However, there are a number of situations where do we need to use and manage keys, such as when we use services that don't support role-based access control. One best practice that we should adopt when handling keys is to rotate (change) them regularly.

Key rotation is important to cover situations where your keys may have compromised. Common attack vectors include keys having been committed to a public GitHub repository, a log file having a key accidentally written to it, or a disgruntled ex-employee retaining a key that had previously been issued. Changing the keys means that the scope of the damage is limited, and if keys aren't changed regularly then these types of vulnerability can be severe.

In many applications, keys are used in complex ways and require manual intervention to rotate. But in other applications, it's possible to completely automate the rotation of keys. In this post I'll explain one such approach, which rotates keys every time the application and its infrastructure components are redeployed. Assuming the application is deployed regularly, for example using a continuous deployment process, we will end up rotating keys very frequently.


The key rotation process I describe here relies on the fact that the services we'll be dealing with - Azure Storage, Cosmos DB, and Service Bus - have both a primary and a secondary key. Both keys are valid for any requests, and they can be changed independently of each other. During each release we will pick one of these keys to use, and we'll make sure that we only use that one. We'll deploy our application components, which will include referencing that key and making sure our application uses it. Then we'll rotate the other key.

The flow of the script is as follows:

  1. Decide whether to use the primary key or the secondary key for this deployment. There are several approaches to do this, which I describe below.

  2. Deploy the ARM template. In our example, the ARM template is the main thing that reads the keys. The template copies the keys into an Azure Function application's configuration settings, as well as into a Key Vault. You could, of course, output the keys and have your deployment script put them elsewhere if you want to.

  3. Run the other deployment logic. For our simple application we don't need to do anything more than run the ARM template deployment, but for many deployments you might copy your application files to a server, swap the deployment slots, or perform a variety of other actions that you need to run as part of your release.

  4. Test the application is working. The Azure Function in our example will perform some checks to ensure the keys are working correctly. You might also run other 'smoke tests' after completing your deployment logic.

  5. Record the key we used. We need to keep track of the keys we’ve used in this deployment so that the next deployment can use the other one.

  6. Rotate the other key. Now we can rotate the key that we are not using. The way that we rotate keys is a little different for each service.

  7. Test the application again. Finally, we run one more check to ensure that our application works. This is mostly a last check to ensure that we haven't accidentally referenced any other keys, which would break our application now that they've been rotated.

We don't rotate any keys until after we've already switched the application to using the other set of keys, so we should never end up in a situation where we've referenced the wrong keys from the Azure Functions application. However, if we wanted to have a true zero-downtime deployment then we could use something like deployment slots to allow for warming up our application before we switch it into production.

A Word of Warning

If you're going to apply this principle in this post or the code below to your own applications, it's important to be aware of an important limitation. The particular approach described here only works if your deployments are completely self-contained, with the keys only used inside the deployment process itself. If you provide keys for your components to any other systems or third parties, rotating keys in this manner will likely cause their systems to break.

Importantly, any shared access signatures and tokens you issue will likely be broken by this process too. For example, if you provide third parties with a SAS token to access a storage account or blob, then rotating the account keys will cause the SAS token to be invalidated. There are some ways to avoid this, including generating SAS tokens from your deployment process and sending them out from there, or by using stored access policies; these approaches are beyond the scope of this post.

The next sections provide some detail on the important steps in the list above.

Step 1: Choosing a Key

The first step we need to perform is to decide whether we should use the primary or secondary keys for this deployment. Ideally each deployment would switch between them - so deployment 1 would use the primary keys, deployment 2 the secondary, deployment 3 the primary, deployment 4 the secondary, etc. This requires that we store some state about the deployments somewhere. Don’t forget, though, that the very first time we deploy the application we won’t have this state set. We need to allow for this scenario too.

The option that I’ve chosen to use in the sample is to use a resource group tag. Azure lets us use tags to attach custom metadata to most resource types, as well as to resource groups. I’ve used a custom tag named CurrentKeys to indicate whether the resources in that group currently use the primary or secondary keys.

There are other places you could store this state too - some sort of external configuration system, or within your release management tool. You could even have your deployment scripts look at the keys currently used by the application code, compare them to the keys on the actual target resources, and then infer which key set is being used that way.

A simpler alternative to maintaining state is to randomly choose to use the primary or secondary keys on every deployment. This may sometimes mean that you end up reusing the same keys repeatedly for several deployments in a row, but in many cases this might not be a problem, and may be worth the simplicity of not maintaining state.

Step 2: Deploy the ARM Template

Our ARM template includes the resource definitions for all of the components we want to create - a storage account, a Cosmos DB account, a Service Bus namespace, and an Azure Function app to use for testing. You can see the full ARM template here.

Note that we are deploying the Azure Function application code using the ARM template deployment method.

Additionally, we copy the keys for our services into the Azure Function app's settings, and into a Key Vault, so that we can access them from our application.

Step 4: Testing the Keys

Once we've finished deploying the ARM template and completing any other deployment steps, we should test to make sure that the keys we're trying to use are valid. Many deployments include some sort of smoke test - a quick test of core functionality of the application. In this case, I wrote an Azure Function that will check that it can connect to the Azure resources in question.

Testing Azure Storage Keys

To test connectivity to Azure Storage, we run a query against the storage API to check if a blob container exists. We don't actually care if the container exists or not; we just check to see if we can successfully make the request:

Testing Cosmos DB Keys

To test connectivity to Cosmos DB, we use the Cosmos DB SDK to try to retrieve some metadata about the database account. Once again we're not interested in the results, just in the success of the API call:

Testing Service Bus Keys

And finally, to test connectivity to Service Bus, we try to get a list of queues within the Service Bus namespace. As long as we get something back, we consider the test to have passed:

You can view the full Azure Function here.

Step 6: Rotating the Keys

One of the last steps we perform is to actually rotate the keys for the services. The way in which we request key rotations is different depending on the services we're talking to.

Rotating Azure Storage Keys

Azure Storage provides an API that can be used to regenerate an account key. From PowerShell we can use the New-AzureRmStorageAccountKey cmdlet to access this API:

Rotating Cosmos DB Keys

For Cosmos DB, there is a similar API to regenerate an account key. There are no first-party PowerShell cmdlets for Cosmos DB, so we can instead a generic Azure Resource Manager cmdlet to invoke the API:

Rotating Service Bus Keys

Service Bus provides an API to regenerate the keys for a specified authorization rule. For this example we're using the default RootManageSharedAccessKey authorization rule, which is created automatically when the Service Bus namespace is provisioned. The PowerShell cmdlet New-AzureRmServiceBusKey can be used to access this API:

You can see the full script here.


Key management and rotation is often a painful process, but if your application deployments are completely self-contained then the process described here is one way to ensure that you continuously keep your keys changing and up-to-date.

You can download the full set of scripts and code for this example from GitHub.

Deploying Azure Functions with ARM Templates

This post was originally published on the Kloud blog.

There are many different ways in which an Azure Function can be deployed. In a future blog post I plan to go through the whole list. There is one deployment method that isn't commonly known though, and it's of particular interest to those of us who use ARM templates to deploy our Azure infrastructure. Before I describe it, I'll quickly recap ARM templates.

ARM Templates

Azure Resource Manager (ARM) templates are JSON files that describe the state of a resource group. They typically declare the full set of resources that need to be provisioned or updated. ARM templates are idempotent, so a common pattern is to run the template deployment regularly—often as part of a continuous deployment process—which will ensure that the resource group stays in sync with the description within the template.

In general, the role of ARM templates is typically to deploy the infrastructure required for an application, while the deployment of the actual application logic happens separately. However, Azure Functions' ARM integration has a feature whereby an ARM template can be used to deploy the files required to make the function run.

How to Deploy Functions in an ARM Template

In order to deploy a function through an ARM template, we need to declare a resource of type Microsoft.Web/sites/functions, like this:

There are two important parts to this.

First, the config property is essentially the contents of the function.json file. It includes the list of bindings for the function, and in the example above it also includes the disabled property.

Second, the files property is an object that contains key-value pairs representing each file to deploy. The key represents the filename, and the value represents the full contents of the file. This only really works for text files, so this deployment method is probably not the right choice for precompiled functions and other binary files. Also, the file needs to be inlined within the template, which may quickly get unwieldy for larger function files—and even for smaller files, the file needs to be escaped as a JSON string. This can be done using an online tool like this, or you could use a script to do the escaping and pass the file contents as a parameter into the template deployment.

Importantly, in my testing I found that using this method to deploy over an existing function will remove any files that are not declared in the files list, so be careful when testing this approach if you've modified the function or added any files through the portal or elsewhere.


There are many different ways you can insert your function file into the template, but one of the ways I tend to use is a PowerShell script. Inside the script, we can read the contents of the file into a string, and create a HashTable for the ARM template deployment parameters:

Then we can use the New-AzureRmResourceGroupDeployment cmdlet to execute the deployment, passing in $templateParameters to the -TemplateParameterObject argument.

You can see the full example here.

Of course, if you have a function that doesn't change often then you could instead manually convert the file into a JSON-encoded string using a tool like this one, and paste the function right into the ARM template. To see a full example of how this can be used, check out this example ARM template from a previous blog article I wrote.

When to Use It

Deploying a function through an ARM template can make sense when you have a very simple function that is comprised of one, or just a few, files to be deployed. In particular, if you already deploy the function app itself through the ARM template then this might be a natural extension of what you're doing.

This type of deployment can also make sense if you're wanting to quickly deploy and test a function and don't need some of the more complex deployment-related features like control over handling locked files. It's also a useful technique to have available for situations where a full deployment script might be too heavyweight.

However, for precompiled functions, functions that have binary files, and for complex deployments, it's probably better to use another deployment mechanism. Nevertheless, I think it's useful to know that this is a tool in your Azure Functions toolbox.

Deploying Blob Containers with ARM Templates

This post was originally published on the Kloud blog.

ARM templates are a great way to programmatically deploy your Azure resources. They act as declarative descriptions of the desired state of an Azure resource group, and while they can be frustrating to work with, overall the ability to use templates to deploy your Azure resources provides a lot of value.

One common frustration with ARM templates is that certain resource types simply can't be deployed with them. Until recently, one such resource type was a blob container. ARM templates could deploy Azure Storage accounts, but not blob containers, queues, or tables within them.

That has now changed, and it's possible to deploy a blob container through an ARM template. Here's an example template that deploys a container called logs within a storage account:

Queues and tables still can't be deployed this way, though - hopefully that's coming soon.

Avoiding Cosmos DB Bill Shock with Azure Functions

This post was originally published on the Kloud blog.

Cosmos DB is a fantastic database service for many different types of applications. But it can also be quite expensive, especially if you have a number of instances of your database to maintain. For example, in some enterprise development teams you may need to have dev, test, UAT, staging, and production instances of your application and its components. Assuming you're following best practices and keeping these isolated from each other, that means you're running at least five Cosmos DB collections. It's easy for someone to accidentally leave one of these Cosmos DB instances provisioned at a higher throughput than you expect, and before long you're racking up large bills, especially if the higher throughput is left overnight or over a weekend.

In this post I'll describe an approach I've been using recently to ensure the Cosmos DB collections in my subscriptions aren't causing costs to escalate. I've created an Azure Function that will run on a regular basis. It uses a managed service identity to identify the Cosmos DB accounts throughout my whole Azure subscription, and then it looks at each collection in each account to check that they are set at the expected throughput. If it finds anything over-provisioned, it sends an email so that I can investigate what's happening. You can run the same function to help you identify over-provisioned collections too.

Step 1: Create Function App

First, we need to set up an Azure Functions app. You can do this in many different ways; for simplicity, we'll use the Azure Portal for everything here.

Click Create a Resource on the left pane of the portal, and then choose Serverless Function App. Enter the information it prompts for - a globally unique function app name, a subscription, a region, and a resource group - and click Create.


Step 2: Enable a Managed Service Identity

Once we have our function app ready, we need to give it a managed service identity. This will allow us to connect to our Azure subscription and list the Cosmos DB accounts within it, but without us having to maintain any keys or secrets. For more information on managed service identities, check out my previous post.

Open up the Function Apps blade in the portal, open your app, and click Platform Features, then Managed service identity:


Switch the feature to On and click Save.

Step 3: Create Authorisation Rules

Now we have an identity for our function, we need to grant it access to the parts of our Azure subscription we want it to examine for us. In my case I'll grant it the rights over my whole subscription, but you could just give it rights on a single resource group, or even just a single Cosmos DB account. Equally you can give it access across multiple subscriptions and it will look through them all.

Open up the Subscriptions blade and choose the subscription you want it to look over. Click Access Control (IAM):


Click the Add button to create a new role assignment.

The minimum role we need to grant the function app is called Cosmos DB Account Reader Role. This allows the function to discover the Cosmos DB accounts, and to retrieve the read-only keys for those accounts, as described here. The function app can't use this role to make any changes to the accounts.

Finally, enter the name of your function app, click it, and click Save:

This will create the role assignment. Your function app is now authorised to enumerate and access Cosmos DB accounts throughout the subscription.

Step 4: Add the Function

Next, we can actually create our function. Go back into the function app and click the + button next to Functions. We'll choose to create a custom function:

Then choose a timer trigger:

Choose C# for the language, and enter the name CosmosChecker. (Feel free to use a name with more panache if you want.) Leave the timer settings alone for now:


Your function will open up with some placeholder code. We'll ignore this for now. Click the View files button on the right side of the page, and then click the Add button. Create a file named project.json, and then open it and paste in the following, then click Save:

This will add the necessary package references that we need to find and access our Cosmos DB collections, and then to send alert emails using SendGrid.

Now click on the run.csx file and paste in the following file:

I won't go through the entire script here, but I have added comments to try to make its purpose a little clearer.

Finally, click on the function.json file and replace the contents with the following:

This will configure the function app with the necessary timer, as well as an output binding to send an email. We'll discuss most of these settings later, but one important setting to note is the schedule setting. The value I've got above means the function will run every hour. You can change it to other values using cron expressions, such as:

  • Run every day at 9.30am UTC: 0 30 9 * * *

  • Run every four hours: 0 0 */4 * * *

  • Run once a week: 0 0 * * 0

You can decide how frequently you want this to run and replace the schedule with the appropriate value from above.

Step 5: Get a SendGrid Account

We're using SendGrid to send email alerts. SendGrid has built-in integration with Azure Functions so it's a good choice, although you're obviously welcome to switch out for anything else if you'd prefer. You might want an SMS message to be sent via Twilio, or a message to be sent to Slack via the Slack webhook API, for example.

If you don't already have a SendGrid account you can sign up for a free account on their website. Once you've got your account, you'll need to create an API key and have it ready for the next step.

Step 6: Configure Function App Settings

Click on your function app name and then click on Application settings:


Scroll down to the Application settings section. We'll need to enter three settings here:

  1. Setting name: SendGridKey. This should have a value of your SendGrid API key from step 5.

  2. Setting name: AlertToAddress. This should be the email address that you want alerts to be sent to.

  3. Setting name: AlertFromAddress. This should be the email address that you want alerts to be sent from. This can be the same as the 'to' address if you want.

Your Application settings section should look something like this:


Step 7: Run the Function

Now we can run the function! Click on the function name again (CosmosChecker), and then click the Run button. You can expand out the Logs pane at the bottom of the screen if you want to watch it run:

Depending on how many Cosmos DB accounts and collections you have, it may take a minute or two to complete.

If you've got any collections provisioned over 2000 RU/s, you should receive an email telling you this fact:

Configuring Alert Policies

By default, the function is configured to alert whenever it sees a Cosmos DB collection provisioned over 2000 RU/s. However, your situation may be quite different to mine. For example, you may want to be alerted whenever you have any collections provisioned over 1000 RU/s. Or, you may have production applications that should be provisioned up to 100,000 RU/s, but you only want development and test collections provisioned at 2000 RU/s.

You can configure alert policies in two ways.

First, if you have a specific collection that should have a specific policy applied to it - like the production collection I mentioned that should be allowed to go to 100,000 RU/s - then you can create another application setting. Give it the name MaximumThroughput:, and set the value to the limit you want for that collection.

For example, a collection named customers in a database named customerdb in an account named myaccount-prod would have a setting named MaximumThroughput:myaccount-prod:customerdb:customers. The value would be 100000, assuming you wanted the function to check this collection against a limit of 100,000 RU/s.

Second, by default the function has a default quota of 2000 RU/s. You can adjust this to whatever value you want by altering the value on line 17 of the function code file (run.csx).

ARM Template

If you want to deploy this function for yourself, you can also use an ARM template I have prepared. This performs all the steps listed above except step 3, which you still need to do manually.

Of course, you are also welcome to adjust the actual logic involved in checking the accounts and collections to suit your own needs. The full code is available on GitHub and you are welcome to take and modify it as much as you like! I hope this helps to avoid some nasty bill shocks.

Demystifying Managed Service Identities on Azure

This post was originally published on the Kloud blog.

Managed service identities (MSIs) are a great feature of Azure that are being gradually enabled on a number of different resource types. But when I'm talking to developers, operations engineers, and other Azure customers, I often find that there is some confusion and uncertainty about what they do. In this post I will explain what MSIs are and are not, where they make sense to use, and give some general advice on how to work with them.

What Do Managed Service Identities Do?

A managed service identity allows an Azure resource to identify itself to Azure Active Directory without needing to present any explicit credentials. Let's explain that a little more.

In many situations, you may have Azure resources that need to securely communicate with other resources. For example, you may have an application running on Azure App Service that needs to retrieve some secrets from a Key Vault. Before MSIs existed, you would need to create an identity for the application in Azure AD, set up credentials for that application (also known as creating a service principal), configure the application to know these credentials, and then communicate with Azure AD to exchange the credentials for a short-lived token that Key Vault will accept. This requires quite a lot of upfront setup, and can be difficult to achieve within a fully automated deployment pipeline. Additionally, to maintain a high level of security, the credentials should be changed (rotated) regularly, and this requires even more manual effort.

With an MSI, in contrast, the App Service automatically gets its own identity in Azure AD, and there is a built-in way that the app can use its identity to retrieve a token. We don't need to maintain any AD applications, create any credentials, or handle the rotation of these credentials ourselves. Azure takes care of it for us.

It can do this because Azure can identify the resource - it already knows where a given App Service or virtual machine 'lives' inside the Azure environment, so it can use this information to allow the application to identify itself to Azure AD without the need for exchanging credentials.

What Do Managed Service Identities Not Do?

Inbound requests: One of the biggest points of confusion about MSIs is whether they are used for inbound requests to the resource or for outbound requests from the resource. MSIs are for the latter - when a resource needs to make an outbound request, it can identify itself with an MSI and pass its identity along to the resource it's requesting access to.

MSIs pair nicely with other features of Azure resources that allow for Azure AD tokens to be used for their own inbound requests. For example, Azure Key Vault accepts requests with an Azure AD token attached, and it evaluates which parts of Key Vault can be accessed based on the identity of the caller. An MSI can be used in conjunction with this feature to allow an Azure resource to directly access a Key Vault-managed secret.

Authorization: Another important point is that MSIs are only directly involved in authentication, and not in authorization. In other words, an MSI allows Azure AD to determine what the resource or application is, but that by itself says nothing about what the resource can do. For some Azure resources this is Azure's own Identity and Access Management system (IAM). Key Vault is one exception - it maintains its own access control system, and is managed outside of Azure's IAM. For non-Azure resources, we could communicate with any authorisation system that understands Azure AD tokens; an MSI will then just be another way of getting a valid token that an authorisation system can accept.

Another important point to be aware of is that the target resource doesn't need to run within the same Azure subscription, or even within Azure at all. Any service that understands Azure Active Directory tokens should work with tokens for MSIs.

How to Use MSIs

Now that we know what MSIs can do, let's have a look at how to use them. Generally there will be three main parts to working with an MSI: enabling the MSI; granting it rights to a target resource; and using it.

  1. Enabling an MSI on a resource. Before a resource can identify itself to Azure AD,it needs to be configured to expose an MSI. The way that you do this will depend on the specific resource type you're enabling the MSI on. In App Services, an MSI can be enabled through the Azure Portal, through an ARM template, or through the Azure CLI, as documented here. For virtual machines, an MSI can be enabled through the Azure Portal or through an ARM template. Other MSI-enabled services have their own ways of doing this.

  2. Granting rights to the target resource. Once the resource has an MSI enabled, we can grant it rights to do something. The way that we do this is different depending on the type of target resource. For example, Key Vault requires that you configure its Access Policies, while to use the Event Hubs or the Azure Resource Manager APIs you need to use Azure's IAM system. Other target resource types will have their own way of handling access control.

  3. Using the MSI to issue tokens. Finally, now that the resource's MSI is enabled and has been granted rights to a target resource, it can be used to actually issue tokens so that a target resource request can be issued. Once again, the approach will be different depending on the resource type. For App Services, there is an HTTP endpoint within the App Service's private environment that can be used to get a token, and there is also a .NET library that will handle the API calls if you're using a supported platform. For virtual machines, there is also an HTTP endpoint that can similarly be used to obtain a token. Of course, you don't need to specify any credentials when you call these endpoints - they're only available within that App Service or virtual machine, and Azure handles all of the credentials for you.

Finding an MSI's Details and Listing MSIs

There may be situations where we need to find our MSI's details, such as the principal ID used to represent the application in Azure AD. For example, we may need to manually configure an external service to authorise our application to access it. As of April 2018, the Azure Portal shows MSIs when adding role assignments, but the Azure AD blade doesn't seem to provide any way to view a list of MSIs. They are effectively hidden from the list of Azure AD applications. However, there are a couple of other ways we can find an MSI.

If we want to find a specific resource's MSI details then we can go to the Azure Resource Explorer and find our resource. The JSON details for the resource will generally include an identity property, which in turn includes a principalId:


That principalId is the client ID of the service principal, and can be used for role assignments.

Another way to find and list MSIs is to use the Azure AD PowerShell cmdlets. The Get-AzureRmADServicePrincipal cmdlet will return back a complete list of service principals in your Azure AD directory, including any MSIs. MSIs have service principal names starting with, and the ApplicationId is the client ID of the service principal:


Now that we've seen how to work with an MSI, let's look at which Azure resources actually support creating and using them.

Resource Types with MSI and AAD Support

As of April 2018, there are only a small number of Azure services with support for creating MSIs, and of these, currently all of them are in preview. Additionally, while it's not yet listed on that page, Azure API Management also supports MSIs - this is primarily for handling Key Vault integration for SSL certificates.

One important note is that for App Services, MSIs are currently incompatible with deployment slots - only the production slot gets assigned an MSI. Hopefully this will be resolved before MSIs become fully available and supported.

As I mentioned above, MSIs are really just a feature that allows a resource to assume an identity that Azure AD will accept. However, in order to actually use MSIs within Azure, it's also helpful to look at which resource types support receiving requests with Azure AD authentication, and therefore support receiving MSIs on incoming requests. Microsoft maintain a list of these resource types here.

Example Scenarios

Now that we understand what MSIs are and how they can be used with AAD-enabled services, let's look at a few example real-world scenarios where they can be used.

Virtual Machines and Key Vault

Azure Key Vault is a secure data store for secrets, keys, and certificates. Key Vault requires that every request is authenticated with Azure AD. As an example of how this might be used with an MSI, imagine we have an application running on a virtual machine that needs to retrieve a database connection string from Key Vault. Once the VM is configured with an MSI and the MSI is granted Key Vault access rights, the application can request a token and can then get the connection string without needing to maintain any credentials to access Key Vault.

API Management and Key Vault

Another great example of an MSI being used with Key Vault is Azure API Management. API Management creates a public domain name for the API gateway, to which we can assign a custom domain name and SSL certificate. We can store the SSL certificate inside Key Vault, and then give Azure API Management an MSI and access to that Key Vault secret. Once it has this, API Management can automatically retrieve the SSL certificate for the custom domain name straight from Key Vault, simplifying the certificate installation process and improving security by ensuring that the certificate is not directly passed around.

Azure Functions and Azure Resource Manager

Azure Resource Manager (ARM) is the deployment and resource management system used by Azure. ARM itself supports AAD authentication. Imagine we have an Azure Function that needs to scan our Azure subscription to find resources that have recently been created. In order to do this, the function needs to log into ARM and get a list of resources. Our Azure Functions app can expose an MSI, and so once that MSI has been granted reader rights on the resource group, the function can get a token to make ARM requests and get the list without needing to maintain any credentials.

App Services and Event Hubs/Service Bus

Event Hubs is a managed event stream. Communication to both publish onto, and subscribe to events from, the stream can be secured using Azure AD. An example scenario where MSIs would help here is when an application running on Azure App Service needs to publish events to an Event Hub. Once the App Service has been configured with an MSI, and Event Hubs has been configured to grant that MSI publishing permissions, the application can retrieve an Azure AD token and use it to post messages without having to maintain keys.

Service Bus provides a number of features related to messaging and queuing, including queues and topics (similar to queues but with multiple subscribers). As with Event Hubs, an application could use its MSI to post messages to a queue or to read messages from a topic subscription, without having to maintain keys.

App Services and Azure SQL

Azure SQL is a managed relational database, and it supports Azure AD authentication for incoming connections. A database can be configured to allow Azure AD users and applications to read or write specific types of data, to execute stored procedures, and to manage the database itself. When coupled with an App Service with an MSI, Azure SQL's AAD support is very powerful - it reduces the need to provision and manage database credentials, and ensures that only a given application can log into a database with a given user account. Tomas Restrepo has written a great blog post explaining how to use Azure SQL with App Services and MSIs.


In this post we've looked into the details of managed service identities (MSIs) in Azure. MSIs provide some great security and management benefits for applications and systems hosted on Azure, and enable high levels of automation in our deployments. While they aren't particularly complicated to understand, there are a few subtleties to be aware of. As long as you understand that MSIs are for authentication of a resource making an outbound request, and that authorisation is a separate thing that needs to be managed independently, you will be able to take advantage of MSIs with the services that already support them, as well as the services that may soon get MSI and AAD support.

Cosmos DB Server-Side Programming with TypeScript - Part 6: Build and Deployment

This post was originally published on the Kloud blog.

So far in this series we've been compiling our server-side TypeScript code to JavaScript locally on our own machines, and then copying and pasting it into the Azure Portal. However, an important part of building a modern application - especially a cloud-based one - is having a reliable automated build and deployment process. There are a number of reasons why this is important, ranging from ensuring that a developer isn't building code on their own machine - and therefore may be subject to environmental variations or differences that cause different outputs - through to running a suite of tests on every build and release. In this post we will look at how Cosmos DB server-side code can be built and released in a fully automated process.

This post is part of a series:

  • Part 1 gives an overview of the server side programmability model, the reasons why you might want to consider server-side code in Cosmos DB, and some key things to watch out for.

  • Part 2 deals with user-defined functions, the simplest type of server-side programming, which allow for adding simple computation to queries.

  • Part 3 talks about stored procedures. These provide a lot of powerful features for creating, modifying, deleting, and querying across documents - including in a transactional way.

  • Part 4 introduces triggers. Triggers come in two types - pre-triggers and post-triggers - and allow for behaviour like validating and modifying documents as they are inserted or updated, and creating secondary effects as a result of changes to documents in a collection.

  • Part 5 discusses unit testing your server-side scripts. Unit testing is a key part of building a production-grade application, and even though some of your code runs inside Cosmos DB, your business logic can still be tested.

  • Finally, part 6 (this post) explains how server-side scripts can be built and deployed into a Cosmos DB collection within an automated build and release pipeline, using Microsoft Visual Studio Team Services (VSTS).

Build and Release Systems

There are a number of services and systems that provide build and release automation. These include systems you need to install and manage yourself, such as Atlassian Bamboo, Jenkins, and Octopus Deploy, through to managed systems like Amazon CodePipeline/CodeBuild, Travis CI, and AppVeyor. In our case, we will use Microsoft's Visual Studio Team System (VSTS), which is a managed (hosted) service that provides both build and release pipeline features. However, the steps we use here can easily be adapted to other tools.

I will assume that you have a VSTS account, that you have loaded the code into a source code repository that VSTS can access, and that you have some familiarity with the VSTS build and release system.

Throughout this post, we will use the same code that we used in part 5 of this series, where we built and tested our stored procedure. The exact same process can be used for triggers and user-defined functions as well. I'll assume that you have a copy of the code from part 5 - if you want to download it, you can get it from the GitHub repository for that post. If you want to refer to the finished version of the whole project, you can access it on GitHub here.

Defining our Build Process

Before we start configuring anything, let's think about what we want to achieve with our build process. I find it helpful to think about the start point and end point of the build. We know that when we start the build, we will have our code within a Git repository. When we finish, we want to have two things: a build artifact in the form of a JavaScript file that is ready to deploy to Cosmos DB; and a list of unit test results. Additionally, the build should pass if all of the steps ran successfully and the tests passed, and it should fail if any step or any test failed.

Now that we have the start and end points defined, let's think about what we need to do to get us there.

  • We need to install our NPM packages. On VSTS, every time we run a build our build environment will be reset, so we can't rely on any files being there from a previous build. So the first step in our build pipeline will be to run npm install.

  • We need to build our code so it's ready to be tested, and then we need to run the unit tests. In part 5 of this series we created an NPM script to help with this when we run locally - and we can reuse the same script here. So our second build step will be to run npm run test.

  • Once our tests have run, we need to report their results to VSTS so it can visualise them for us. We'll look at how to do this below. Importantly, VSTS won't fail the build automatically if there are any test failures, so we'll look at how to do this ourselves shortly.

  • If we get to this point in the build then our code is successfully passing the tests, so now we can create the real release build. Again we have already defined an NPM script for this, so we can reuse that work and call npm run build.

  • Finally, we can publish the release JavaScript file as a build artifact, which makes it available to our release pipeline.

We'll soon see how we can actually configure this. But before we can write our build process, we need to figure out how we'll report the results of our unit tests back to VSTS.

Reporting Test Results

When we run unit tests from inside a VSTS build, the unit test runner needs some way to report the results back to VSTS. There are some built-in integrations with common tools like VSTest (for testing .NET code). For Jasmine, we need to use a reporter that we configure ourselves. The jasmine-tfs-reporter NPM package does this for us - its reporter will emit a specially formatted results file, and we'll tell VSTS to look at this.

Let's open up our package.json file and add the following line into the devDependencies section:

Run npm install to install the package.

Next, create a file named spec/vstsReporter.ts and add the following lines, which will configure Jasmine to send its results to the reporter we just installed:

Finally, let's edit the jasmine.json file. We'll add a new helpers section, which will tell Jasmine to run that script before it starts running our tests. Here's the new jasmine.json file we'll use:

Now run npm run test. You should see that a new testresults folder has been created, and it contains an XML file that VSTS can understand.

That's the last piece of the puzzle we need to have VSTS build our code. Now let's see how we can make VSTS actually run all of these steps.

Creating the Build Configuration

VSTS has a great feature - currently in preview - that allows us to specify our build definition in a YAML file, check it into our source control system, and have the build system execute it. More information on this feature is available in a previous blog post I wrote. We'll make use of this feature here to write our build process.

Create a new file named build.yaml. This file will define all of our build steps. Paste the following contents into the file:

This YAML file tells VSTS to do the following:

  • Run the npm install command.

  • Run the npm run test command. If we get any test failures, this command will cause VSTS to detect an error.

  • Regardless of whether an error was detected, take the test results that have been saved into the testresults folder and publish them. (Publishing just means showing them within the build; they won't be publicly available.)

  • If everything worked up till now, run npm run build to build the releaseable JavaScript file.

  • Publish the releasable JavaScript file as a build artifact, so it's available to the release pipeline that we'll configure shortly.

Commit this file and push it to your Git repository. In VSTS, we can now set up a new build configuration, point it to the YAML file, and let it run. After it finishes, you should see something like this:


We can see that four tests ran and passed. If we click on the Artifacts tab, we can view the artifacts that were published:


And by clicking the Explore button and expanding the drop folder, we can see the exact file that was created:


You can even download the file from here, and confirm that it looks like what we expect to be able to send to Cosmos DB. So, now we have our code being built and tested! The next step is to actually deploy it to Cosmos DB.

Deciding on a Release Process

Cosmos DB can be used in many different types of applications, and the way that we deploy our scripts can differ as well. In some applications, like those that are heavily server-based and have initialisation logic, we might provision our database, collections, and scripts through our application code. In other systems, like serverless applications, we want to provision everything we need during our deployment process so that our application can immediately start to work. This means there are several patterns we can adopt for installing our scripts.

Pattern 1: Use Application Initialisation Logic

If we have an Azure App Service, Cloud Service, or another type of application that provides initialisation lifecycle events, we can use the initialisation code to provision our Cosmos DB database and collection, and to install our stored procedures, triggers, and UDFs. The Cosmos DB client SDKs provide a variety of helpful methods to do this. For example, the .NET and .NET Core SDKs provide this functionality. If the platform you are using doesn't have an SDK, you can also use the REST API provided by Cosmos DB.

This approach is also likely to be useful if we dynamically provision databases and collections while our application runs. We can also use this approach if we have an application warm-up sequence where the existence of the collection can be confirmed and any missing pieces can be added.

Pattern 2: Initialise Serverless Applications with a Custom Function

When we're using serverless technologies like Azure Functions or Azure Logic Apps, we may not have the opportunity to initialise our application the first time it loads. We could check the existence of our Cosmos DB resources whenever we are executing our logic, but this is quite wasteful and inefficient. One pattern that can be used is to write a special 'initialisation' function that is called from our release pipeline. This can be used to prepare the necessary Cosmos DB resources, so that by the time our callers execute our code, the necessary resources are already present. However, this presents some challenges, including the fact that it necessitates mixing our deployment logic and code with our main application code.

Pattern 3: Deploying from VSTS

The approach that I will adopt in this post is to deploy the Cosmos DB resources from our release pipeline in VSTS. This means that we will keep our release process separate from our main application code, and provide us with the flexibility to use the Cosmos DB resources at any point in our application logic. This may not suit all applications, but for many applications that use Cosmos DB, this type of workflow will work well.

There is a lot more to release configuration than I'll be able to discuss here - that could easily be its own blog series. I'll keep this particular post focused just on installing server-side code onto a collection.

Defining the Release Process

As with builds, it's helpful to think through the process we want the release to follow. Again, we'll think first about the start and end points. When we start the release pipeline, we will have the build that we want to release (which will include our compiled JavaScript script). For now, I'll also assume that you have a resource group containing a Cosmos DB account with an existing database and collection, and that you know the account key. In a future post I will elaborate how some of this process can also be automated, but this is outside of the scope of this series. Once the release process finishes, we expect that the collection will have the server-side resource installed and ready to use.

VSTS doesn't have built-in support for Cosmos DB. However, we can easily use a custom PowerShell script to install Cosmos DB scripts on our collection. I've written such a script, and it's available for download here. The script uses the Cosmos DB API to deploy stored procedures, triggers, and user-defined functions to a collection.

We need to include this script into our build artifacts so that we can use it from our deployment process. So, download the file and save it into a deploy folder in the project's source repository. Now that we have that there, we need to tell the VSTS build process to include it as an artifact, so open the build.yaml file and add this to the end of the file, being careful to align the spaces and indentation with the sections above it:

Commit these changes, and then run a new build.

Now we can set up a release definition in VSTS and link it to our build configuration so it can receive the build artifacts. We only need one step currently, which will deploy our stored procedure using the PowerShell script we included as a build artifact. Of course, a real release process is likely to do a lot more, including deploying your application. For now, though, let's just add a single PowerShell step, and configure it to run an inline script with the following contents:

This inline script does the following:

  • It loads in the PowerShell file from our build artifact, so that the functions within that file are available for us to use.

  • It then runs the DeployStoredProcedure function, which is defined in that PowerShell file. We pass in some parameters so the function can contact Cosmos DB:

    • AccountName - this is the name of your Cosmos DB account.

    • AccountKey - this is the key that VSTS can use to talk to Cosmos DB's API. You can get this from the Azure Portal - open up the Cosmos DB account and click the Keys tab.

    • DatabaseName - this is the name of the database (in our case, Orders).

    • CollectionName - this is the name of the collection (in our case again, Orders).

    • StoredProcedureName - this is the name we want our stored procedure to have in Cosmos DB. This doesn't need to match the name of the function inside our code file, but I recommend it does to keep things clear.

    • SourceFilePath - this is the path to the JavaScript file that contains our script.

Note that in the script above I've assumed that the build configuration's name is CosmosServer-CI, so that appears in the two file paths. If you have a build configuration that uses a different name, you'll need to replace it. Also, I strongly recommend you don't hard-code the account name, account key, database name, and collection name like I've done here - you would instead use VSTS variables and have them dynamically inserted by VSTS. Similarly, the account key should be specified as a secret variable so that it is encrypted. There are also other ways to handle this, including creating the Cosmos DB account and collection within your deployment process, and dynamically retrieving the account key. This is beyond the scope of this series, but in a future blog post I plan to discuss some ways to achieve this.

After configuring our release process, it will look something like this:


Now that we've configured our release process we can create a new release and let it run. If everything has been configured properly, we should see the release complete successfully:


And if we check the collection through the Azure Portal, we can see the stored procedure has been deployed:


This is pretty cool. It means that whenever we commit a change to our stored procedure's TypeScript file, it can automatically be compiled, tested, and deployed to Cosmos DB - without any human intervention. We could now adapt the exact same process to deploy our triggers (using the DeployTrigger function in the PowerShell script) and UDFs (using the DeployUserDefinedFunction function). Additionally, we can easily make our build and deployments into true continuous integration (CI) and continuous deployment (CD) pipelines by setting up automated builds and releases within VSTS.


Over this series of posts, we've explored Cosmos DB's server-side programming capabilities. We've written a number of server-side scripts including a UDF, a stored procedure, and two triggers. We've written them in TypeScript to ensure that we're using strongly typed objects when we interact with Cosmos DB and within our own code. We've also seen how we can unit test our code using Jasmine. Finally, in this post, we've looked at how our server-side scripts can be built and deployed using VSTS and the Cosmos DB API.

I hope you've found this series useful! If you have any questions or similar topics that you'd like to know more about, please post them in the comments below.

Key Takeaways

  • Having an automated build and release pipeline is very important to ensure reliable, consistent, and safe delivery of software. This should include our Cosmos DB server-side scripts.

  • It's relatively easy to adapt the work we've already done with our build scripts to work on a build server. Generally it will simply be a matter of executing npm install and then npm run build to create a releasable build of our code.

  • We can also run our unit tests by simply executing npm run test.

  • Test results from Jasmine can be published into VSTS using the jasmine-tfs-reporter package. Other integrations are available for other build servers too.

  • Deploying our server-side scripts onto Cosmos DB can be handled in different ways for different applications. With many applications, having server-side code deployed within an existing release process is a good idea.

  • VSTS doesn't have built-in support for Cosmos DB, but I have provided a PowerShell script that can be used to install stored procedures, triggers, and UDFs.

  • You can view the code for this post on GitHub.

Cosmos DB Server-Side Programming with TypeScript - Part 5: Unit Testing

This post was originally published on the Kloud blog.

Over the last four parts of this series, we've discussed how we can write server-side code for Cosmos DB, and the types of situations where it makes sense to do so. If you're building a small sample application, you now have enough knowledge to go and build out UDFs, stored procedures, and triggers. But if you're writing production-grade applications, there are two other major topics that need discussion: how to unit test your server-side code, and how to build and deploy it to Cosmos DB in an automated and predictable manner. In this part, we'll discuss testing. In the next part, we'll discuss build and deployment.

This post is part of a series:

  • Part 1 gives an overview of the server side programmability model, the reasons why you might want to consider server-side code in Cosmos DB, and some key things to watch out for.

  • Part 2 deals with user-defined functions, the simplest type of server-side programming, which allow for adding simple computation to queries.

  • Part 3 talks about stored procedures. These provide a lot of powerful features for creating, modifying, deleting, and querying across documents - including in a transactional way.

  • Part 4 introduces triggers. Triggers come in two types - pre-triggers and post-triggers - and allow for behaviour like validating and modifying documents as they are inserted or updated, and creating secondary effects as a result of changes to documents in a collection.

  • Part 5 (this post) discusses unit testing your server-side scripts. Unit testing is a key part of building a production-grade application, and even though some of your code runs inside Cosmos DB, your business logic can still be tested.

  • Finally, part 6 explains how server-side scripts can be built and deployed into a Cosmos DB collection within an automated build and release pipeline, using Microsoft Visual Studio Team Services (VSTS).

Unit Testing Cosmos DB Server-Side Code

Testing JavaScript code can be complex, and there are many different ways to do it and different tools that can be used. In this post I will outline one possible approach for unit testing. There are other ways that we could also test our Cosmos DB server-side code, and your situation may be a bit different to the one I describe here. Some developers and teams place different priorities on some of the aspects of testing, so this isn't a 'one size fits all' approach. In this post, the testing approach we will build out allows for:

  • Mocks: mocking allows us to pass in mocked versions of our dependencies so that we can test how our code behaves independently of a working external system. In the case of Cosmos DB, this is very important: the getContext() method, which we've looked at throughout this series, provides us with access to objects that represent the request, response, and collection. Our code needs to be tested without actually running inside Cosmos DB, so we mock out the objects it sends us.

  • Spies: spies are often a special type of mock. They allow us to inspect the calls that have been made to the object to ensure that we are triggering the methods and side-effects that we expect.

  • Type safety: as in the rest of this series, it's important to use strongly typed objects where possible so that we get the full benefit of the TypeScript compiler's type system.

  • Working within the allowed subset of JavaScript: although Cosmos DB server-side code is built using the JavaScript language, it doesn't provide all of the features of JavaScript. This is particularly important when testing our code, because many test libraries make assumptions about how the code will be run and the level of JavaScript support that will be available. We need to work within the subset of JavaScript that Cosmos DB supports.

I will assume some familiarity with these concepts, but even if they're new to you, you should be able to follow along. Also, please note that this series only deals with unit testing. Integration testing your server-side code is another topic, although it should be relatively straightforward to write integration tests against a Cosmos DB server-side script.

Challenges of Testing Cosmos DB Server-Side Code

Cosmos DB ultimately executes JavaScript code, and so we will use JavaScript testing frameworks to write and run our unit tests. Many of the popular JavaScript and TypeScript testing frameworks and helpers are designed specifically for developers who write browser-based JavaScript or Node.js applications. Cosmos DB has some properties that can make these frameworks difficult to work with.

Specifically, Cosmos DB doesn't support modules. Modules in JavaScript allow for individual JavaScript files to expose a public interface to other blocks of code in different files. When I was preparing for this blog post I spent a lot of time trying to figure out a way to handle the myriad testing and mocking frameworks that assume modules are able to be used in our code. Ultimately I came to the conclusion that it doesn't really matter if we use modules inside our TypeScript files as long as the module code doesn't make it into our release JavaScript files. This means that we'll have to build our code twice - once for testing (which include the module information we need), and again for release (which doesn't include modules). This isn't uncommon - many development environments have separate 'Debug' and 'Release' build configurations, for example - and we can use some tricks to achieve our goals while still getting the benefit of a good design-time experience.

Defining Our Tests

We'll be working with the stored procedure that we built out in part 3 of this series. The same concepts could be applied to unit testing triggers, and also to user-defined functions (UDFs) - and UDFs are generally easier to test as they don't have any context variables to mock out.

Looking back at the stored procedure, the purpose is to do return the list of customers who have ordered any of specified list of product IDs, grouped by product ID, and so an initial set of test cases might be as follows:

  1. If the productIds parameter is empty, the method should return an empty array.

  2. If the productIds parameter contains one item, it should execute a query against the collection containing the item's identifier as a parameter.

  3. If the productIds parameter contains one item, the method should return a single CustomersGroupedByProduct object in the output array, which should contain the productId that was passed in, and whatever customerIds the mocked collection query returned.

  4. If the method is called with a valid productIds array, and the queryDocuments method on the collection returns false, an error should be returned by the function.

You might have others you want to focus on, and you may want to split some of these out - but for now we'll work with these so we can see how things work. Also, in this post I'll assume that you've got a copy of the stored procedure from part 3 ready to go - if you haven't, you can download it from the GitHub repository for that part.
If you want to see the finished version of the whole project, including the tests, you can access it on GitHub here.

Setting up TypeScript Configurations

The first change we'll need to make is to change our TypeScript configuration around a bit. Currently we only have one tsconfig.json file that we use to build. Now we'll need to add a second file. The two files will be used for different situations:

  • tsconfig.json will be the one we use for local builds, and for running unit tests.

  • will be the one we use for creating release builds.

First, open up the tsconfig.json file that we already have in the repository. We need to change it to the following:

The key changes we're making are:

  • We're now including files from the spec folder in our build. This folder will contain the tests that we'll be writing shortly.

  • We've added the line "module": "commonjs". This tells TypeScript that we want to compile our code with module support. Again, this tsconfig.json will only be used when we run our builds locally or for running tests, so we'll later make sure that the module-related code doesn't make its way into our release builds.

  • We've changed from using outFile to outDir, and set the output directory to output/test. When we use modules like we're doing here, we can't use the outFile setting to combine our files together, but this won't matter for our local builds and for testing. We also put the output files into a test subfolder of the output folder so that we keep things organised.

Now we need to create a file with the following contents:

This looks more like the original tsconfig.json file we had, but there are a few minor differences:

  • The include element now looks for files matching the pattern *.ready.ts. We'll look at what this means later.

  • The module setting is explicitly set to none. As we'll see later, this isn't sufficient to get everything we need, but it's good to be explicit here for clarity.

  • The outFile setting - which we can use here because module is set to none - is going to emit a JavaScript file within the build subfolder of the output folder.

Now let's add the testing framework.

Adding a Testing Framework

In this post we'll use Jasmine, a testing framework for JavaScript. We can import it using NPM. Open up the package.json file and replace it with this:

There are a few changes to our previous version:

  • We've now imported the jasmine module, as well as the Jasmine type definitions, into our project; and we've imported moq.ts, a mocking library, which we'll discuss below.

  • We've also added a new test script, which will run a build and then execute Jasmine, passing in a configuration file that we will create shortly.

Run npm install from a command line/terminal to restore the packages, and then create a new file named jasmine.json with the following contents:

We'll understand a little more about this file as we go on, but for now, we just need to understand that this file defines the Jasmine specification files that we'll be testing against. Now let's add our Jasmine test specification so we can see this in action.

Starting Our Test Specification

Let's start by writing a simple test. Create a folder named spec, and within it, create a file named getGroupedOrdersImpl.spec.ts. Add the following code to it:

This code does the following:

  • It sets up a new Jasmine spec named getGroupedOrdersImpl. This is the name of the method we're testing for clarity, but it doesn't need to match - you could name the spec whatever you want.

  • Within that spec, we have a test case named should return an empty array.

  • That test executes the getGroupedOrdersImpl function, passing in an empty array, and a null object to represent the Collection.

  • Then the test confirms that the result of that function call is an empty array.

This is a fairly simple test - we'll see a slightly more complex one in a moment. For now, though, let's get this running.

There's one step we need to do before we can execute our test. If we tried to run it now, Jasmine would complain that it can't find the getGroupedOrdersImpl method. This is because of the way that JavaScript modules work. Our code needs to export its externally accessible methods so that the Jasmine test can see it. Normally, exporting a module from a Cosmos DB JavaScript file will mean that Cosmos DB doesn't accept the file anymore - we'll see a solution to that shortly.

Open up the src/getGroupedOrders.ts file, and add the following at the very bottom of the file:

The export statement sets up the necessary TypeScript compilation instruction to allow our Jasmine test spec to reach this method.

Now let's run our test. Execute npm run test, which will compile our stored procedure (including the export), compile the test file, and then execute Jasmine. You should see that Jasmine executes the test and shows 1 spec, 0 failures, indicating that our test successfully ran and passed. Now let's add some more sophisticated tests.

Adding Tests with Mocks and Spies

When we're testing code that interacts with external services, we often will want to use mock objects to represent those external dependencies. Most mocking frameworks allow us to specify the behaviour of those mocks, so we can simulate various conditions and types of responses from the external system. Additionally, we can use spies to observe how our code calls the external system.

Jasmine provides a built-in mocking framework, including spy support. However, the Jasmine mocks don't support TypeScript types, and so we lose the benefit of type safety. In my opinion this is an important downside, and so instead we will use the moq.ts mocking framework. You'll see we have already installed it in the package.json.

Since we've already got it available to us, we need to add this line to the top of our spec/getGroupedOrders.spec.ts file:

This tells TypeScript to import the relevant mocking types from the moq.ts module. Now we can use the mocks in our tests.

Let's set up another test, in the same file, as follows:

This test does a little more than the last one:

  • It sets up a mock of the ICollection interface.

  • This mock will send back a hard-coded string (self-link) when the getSelfLink() method is called.

  • It also provides mock behaviour for the queryDocuments method. When the method is called, it invokes the callback function, passing back a list of documents with a single empty string, and then returns true to indicate that the query was accepted.

  • The mock.object() method is used to convert the mock into an instance that can be provided to the getGroupedOrderImpl function, which then uses that in place of the real Cosmos DB collection. This means we can test out how our code will behave, and we can emulate the behaviour of Cosmos DB as we wish.

  • Finally, we call mock.verify to ensure that the getGroupedOrdersImpl function executed the queryDocuments method on the mock collection exactly once.

You can run npm run test again now, and verify that it shows 2 specs, 0 failures, indicating that our new test has successfully passed.

Now let's fill out the rest of the spec file - here's the complete file with all of our test cases included:

You can execute the tests again by calling npm run test. Try tweaking the tests so that they fail, then re-run them and see what happens.

Building and Running

All of the work we've just done means that we can run our tests. However, if we try to build our code to submit to Cosmos DB, it won't work anymore. This is because the export statement we added to make our tests work will emit code that Cosmos DB's JavaScript engine doesn't understand.

We can remove this code at build time by using a preprocessor. This will remove the export statement - or anything else we want to take out - from the TypeScript file. The resulting cleaned file is the one that then gets sent to the TypeScript compiler, and it emits a Cosmos DB-friendly JavaScript file.

To achieve this, we need to chain together a few pieces. First, let's open up the src/getGroupedOrders.ts file. Replace the line that says export with this section:

The extra lines we've added are preprocessor directives. TypeScript itself doesn't understand these directives, so we need to use an NPM package to do this. The one I've used here is jspreproc. It will look through the file and handle the directives it finds in specially formatted comments, and then emits the resulting cleaned file. Unfortunately, the preprocessor only works on a single file at a time. This is OK for our situation, as we have all of our stored procedure code in one file, but we might not do that for every situation. Therefore, I have also used the foreach-cli NOM package to search for all of the *.ts files within our src folder and process them. It saves the cleaner files with a .ready.ts extension, which our file refers to.

Open the package.json file and replace it with the following contents:

Now we can run npm install to install all of the packages we're using. You can then run npm run test to run the Jasmine tests, and npm run build to build the releasable JavaScript file. This is emitted into the output/build/sp-getGroupedOrders.js file, and if you inspect that file, you'll see it doesn't have any trace of module exports. It looks just like it did back in part 3, which means we can send it to Cosmos DB without any trouble.


In this post, we've built out the necessary infrastructure to test our Cosmos DB server-side code. We've used Jasmine to run our tests, and moq.ts to mock out the Cosmos DB server objects in a type-safe manner. We also adjusted our build script so that we can compile a clean copy of our stored procedure (or trigger, or UDF) while keeping the necessary export statements to enable our tests to work. In the final post of this series, we'll look at how we can automate the build and deployment of our server-side code using VSTS, and integrate it into a continuous integration and continuous deployment pipeline.

Key Takeaways

  • It's important to test Cosmos DB server-side code. Stored procedures, triggers, and UDFs contain business logic and should be treated as a fully fledged part of our application code, with the same quality criteria we would apply to other types of source code.

  • Because Cosmos DB server-side code is written in JavaScript, it is testable using JavaScript and TypeScript testing frameworks and libraries. However, the lack of support for modules means that we have to be careful in how we use these since they may emit release code that Cosmos DB won't accept.

  • We can use Jasmine for testing. Jasmine also has a mocking framework, but it is not strongly typed.

  • We can get strong typing using a TypeScript mocking library like moq.ts.

  • By structuring our code correctly - using a single entry-point function, which calls out to getContext() and then sends the necessary objects into a function that implements our actual logic - we can easily mock and spy on our calls to the Cosmos DB server-side libraries.

  • We need to export the functions we are testing using the export statement. This makes them available to the Jasmine test spec.

  • However, these export statements need to be removed before we can compile our release version. We can use a preprocessor to remove those statements.

  • You can view the code for this post on GitHub.

Cosmos DB Server-Side Programming with TypeScript - Part 4: Triggers

This post was originally published on the Kloud blog.

Triggers are the third type of server-side code in Cosmos DB. Triggers allow for logic to be run while an operation is running on a document. When a document is to be created, modified, or deleted, our custom logic can be executed - either before or after the operation takes place - allowing us to validate documents, transform documents, and even create secondary documents or perform other operations on the collection. As with stored procedures, this all takes place within the scope of an implicit transaction. In this post, we'll discuss the two types of triggers (pre- and post-triggers), and how we can ensure they are executed when we want them to be. We'll also look at how we can validate, modify, and cause secondary effects from triggers.

This post is part of a series of posts about server-side programming for Cosmos DB:

This post is part of a series:

  • Part 1 gives an overview of the server side programmability model, the reasons why you might want to consider server-side code in Cosmos DB, and some key things to watch out for.

  • Part 2 deals with user-defined functions, the simplest type of server-side programming, which allow for adding simple computation to queries.

  • Part 3 talks about stored procedures. These provide a lot of powerful features for creating, modifying, deleting, and querying across documents - including in a transactional way.

  • Part 4 (this post) introduces triggers. Triggers come in two types - pre-triggers and post-triggers - and allow for behaviour like validating and modifying documents as they are inserted or updated, and creating secondary effects as a result of changes to documents in a collection.

  • Part 5 discusses unit testing your server-side scripts. Unit testing is a key part of building a production-grade application, and even though some of your code runs inside Cosmos DB, your business logic can still be tested.

  • Finally, part 6 explains how server-side scripts can be built and deployed into a Cosmos DB collection within an automated build and release pipeline, using Microsoft Visual Studio Team Services (VSTS).


Triggers let us run custom logic within the scope of an operation already in progress against our collection. This is different to stored procedures, which are explicitly invoked when we want to use them. Triggers give us a lot of power, as they allow us to intercept documents before they are created, modified, or deleted. We can then write our own business logic to perform some custom validation of the operation or document, modify it in some form, or perform another action on the collection instead or as well as the originally requested action.

Similarly to stored procedures, triggers run within a implicit transactional scope - but unlike stored procedures, the transaction isn't just isolated to the actions we perform within our code. It includes the operation that caused the trigger. This means that if we throw an unhandled error inside a trigger, the whole operation - including the original caller's request to insert or modify a document - will be rolled back.

Trigger Behaviour

Triggers have two ways they can be configured: which operations should trigger them and when they should run. We set both of these options at the time when we deploy or update the trigger, using the Trigger Operation and Trigger Type settings respectively. These two settings dictate exactly when a trigger should run.

The trigger operation specifies the types of document operations that should cause the trigger to fire. This can be set to Create (i.e. fire the trigger only when documents are inserted), Delete (i.e. fire the trigger only when documents are deleted), Replace (i.e. fire the trigger only when documents are replaced), or All (fire when any of these operations occur).

The trigger type specifies when the trigger should run with respect to the operation. Pre-triggers run before the operation is performed against the collection, and post-triggers run after the operation is performed. Regardless of when we choose for the trigger to run, the trigger still takes place within the transaction, and the whole transaction can be cancelled by throwing any unhandled error within either type of trigger.

Note that Cosmos DB also runs a validity check before it starts executing any pre-triggers, too. This validity check ensures that the operation is able to be performed - for example, a deletion operation can't be performed on a document that doesn't exist, and an insertion operation can't be performed if a document with the specified ID already exists. For this reason, we won't see a pre-trigger fire for invalid requests. The following figure gives an overview of how this flow works, and where triggers can execute our custom code:

If we want to filter to only handle certain sub-types of documents, for example based on their properties, then this can be done within the trigger logic. We can inspect the contents of the document and apply logic selectively.

Working with the Context

Just like in stored procedures, when we write triggers we have access to the context object through the getContext() method. In triggers, we'll typically make use of the Collection object (getContext().getCollection()), and depending on the type of trigger we'll also use the Request object (getContext().getRequest()) and potentially the Response object (getContext().getResponse()).


The Request object gives us the ability to read and modify the original request that caused this trigger to fire. We might choose to inspect the document that the user tried to manipulate, perhaps looking for required fields or imposing our own validation logic. Or we can also modify the document if we need to. We'll discuss how we do this below.

If you've set up a trigger to fire on all operations, then we might sometimes need to find out which operation type has been used for the request. The request.getOperationType() function can be used for this purpose. Please note that the getOperationType() function returns a string, and somewhat confusingly, the possible values that it can return are slightly different to the operation types we discussed above. The function can return one of these strings: Create, Replace, Upsert, or Delete.


The Collection object can be used in much the same way as in a stored procedure, so refer to part 3 of this series for more details on how this works. In summary, we can retrieve documents by their ID (readDocument()), perform queries on the documents in the collection (queryDocuments()), and pass in parameterised queries (using another overload of the queryDocuments() function). We can also insert documents (createDocument()), modify existing documents (replaceDocument()), upsert documents (upsertDocument()), and delete documents (deleteDocument()). We can even change documents that aren't directly being modified by the operation that caused this trigger.


The Response object is only available in post-triggers. If we try to use it from within a pre-trigger, an exception will be thrown by the Cosmos DB runtime. Within post-triggers, the Response represents the document after the operation was completed by Cosmos DB. This can be a little unclear, so here's an example. Let's imagine we're inserting a simple document, and not including our own ID - we want Cosmos DB to add it for us. Here's the document we're going to insert from our client:

If we intercepted this document within a pre-trigger, by using the Request object, we'd be able to inspect and modify the document as it appears above. Once the pre-trigger has finished its work, Cosmos DB will actually perform the insert, and at that time it appends a number of properties to the document automatically for its own tracking purposes. These properties are then available to us within the Response object in a post-trigger. So this means that, within a post-trigger, we can inspect the Request - which will show the original document as it appears above - and we can also inspect the Response, where the document will look something like this:

We can even use the response.setBody() function to make a change to this response object if we want to, and the change will be propagated to the document in the collection before control returns back to the client.

You might be wondering what a delete operation's behaviour is. If we examine the body through either the Request or Response during a deletion operation, you'll see that these functions return the contents of document that is being deleted. This means we can easily perform validation logic based on the contents of the document, such as cancelling the deletion based on the contents of the document. Note that request.setBody() cannot be used within a deletion, although interestingly response.setBody() does appear to work, but doesn't actually do anything (at least that I can find!).

The combination of the Request, Response, and Collection objects gives us a lot of power to run custom logic at all points along the process. Now we know what we can do with triggers, let's talk about how we cause them to be fired.

Firing a Trigger

There is a non-obvious aspect to using triggers within Cosmos DB: at the time we perform a collection operation, we must explicitly request that the trigger be fired. If you're used to triggers in other databases like SQL Server, this might seem like a rather strange limitation. But - as with so much about Cosmos DB - it is because Cosmos DB prioritises performance above many other considerations. Including triggers within an operation will cause that operation to use more request units, so given the focus on performance, it makes some sense that if a trigger isn't needed for a transaction then it shouldn't be executed. It does make our lives as developers a little more difficult though, because we have to remember to request that triggers be included.

Another important caveat to be aware of is that at most one pre-trigger and one post-trigger can be executed for any given operation. Confusingly, the Cosmos DB .NET client library actually makes it seem like we can execute multiple triggers of a given type because it accepts a list of triggers to fire, but in reality we can only specify one. I assume that this restriction may be lifted in future, but for now if we want to have multiple pieces of logic executing as triggers, we have to combine them into one trigger.

It's also important to note that triggers can't be called from within server-side code. This means that we can't nest triggers - for example, we can't have an operation call trigger A, which in turn would insert a document and expect trigger B to be called. Similarly, if we have a stored procedure inserting a document, it can't expect a trigger to be called either.

And there's one further limitation - the Azure Portal doesn't provide a way to fire triggers within Document Explorer. This makes it a little more challenging to test triggers when you're working on them. I typically do one of three things:

  • I write a basic C# console application or similar, and have it use the Cosmos DB client library to perform the operation and fire the trigger; or

  • I use the REST API to execute a document operation directly - this is a little more challenging as we need to authenticate and structure the request manually, which can be a little complex; or

  • I use a tool named DocumentDB Studio, which is a sample application that uses the Cosmos DB client library. It provides the ability to specify triggers to be fired.

Later in this post we'll look at DocumentDB Studio. For now, let's quickly look at how to request that the Cosmos DB .NET client library fire a trigger during an operation too. In the C# application (with a reference to the Cosmos DB library already in place - i.e. one of the .NET Framework or .NET Standard/.NET Core NuGet packages) we can write something like this:

Now that we've seen when and how triggers can be fired, let's go through a few common use cases for them.


A common application for triggers is to validate documents when they are being modified or inserted. For example, we might want to validate that a particular field is present, or that the value of a field conforms to our expectations. We can write sophisticated validation logic using JavaScript or TypeScript, and then Cosmos DB can execute this when it needs to. If we find a validation failure, we can throw an error and this will cause the transaction to be rolled back by Cosmos DB. We'll work through an example of this scenario later in this post.


We might also need to make a change to a document as it's being inserted or updated. For example, we might want to add some custom metadata to each document, or to transform documents from an old schema into a new schema. The setBody() function on the Request object can be used for this purpose. For example, we can write code like the following:

Secondary Effects

Sometimes we might need to create or modify a different document in the collection as a result of an operation. For example, we might have to create an audit log entry for each document that is deleted, update a metadata document to include some new information as the result of an insert, or automatically write a second document when we see a certain type of document or operation occurring. We'll see an example of this later in this post.

Using Triggers for Aggregation

If we need to perform aggregation of data within a collection, it may be tempting to consider using a trigger to calculate running aggregates as data comes into, or is changed within, the collection. However, as noted in part 1 of this series, this is not a good idea in many cases. Although triggers are transactional, transactions are not serialised, and therefore race conditions can emerge. For example, if we have a document being inserted by transaction A, and a post-trigger is calculating an aggregate across the collection as a result of that insertion, it won't include the data within transaction B, which is occurring simultaneously. These scenarios are extremely difficult to test and may result in unexpected behaviour and inconsistent results. In some situations using optimistic concurrency (with the _etag property of the running aggregate document) may be able to resolve this, but it is very difficult to get this right.

Instead of calculating running aggregate within a trigger, it is usually better to calculate aggregates at query time. Cosmos DB recently added support for aggregations within its SQL API, so querying using functions such as SUM() and COUNT() will work across your data. For applications that require grouping as well as aggregation, such as calculating aggregates by group, an upcoming addition to Cosmos DB will allow for GROUP BY to be used within SQL queries. In the meantime, a stored procedure - similar to that from part 3 of this series - can be used to emulate this behaviour.

Now that we've talked through how triggers work and what they can do, let's try building some.

Defining our Triggers

In this post, we'll walk through creating two triggers - one pre-trigger and one post-trigger.

Our pre-trigger will let us do some validation and manipulation of incoming documents. In part 1 of this series, we built a UDF to allow for a change in the way that customer IDs are specified on order documents. Now we want to take the next step and ensure that any new order documents are using our new format right from the start. This means that we want to allow for this document to be inserted:

But this one should not be able to be inserted as-is:

So there are four code paths that we need to consider in our pre-trigger:

  1. The document is not an order - in this case, the trigger shouldn't do anything and should let the document be inserted successfully.

  2. The document is an order and uses the correct format for the customer ID - in this case, the trigger should let the document be inserted successfully without any modification.

  3. The document is an order and uses the old format for the customer ID - in this case, we want to modify the document so that it uses the correct format for the customer ID.

  4. The document is an order and doesn't include the customer ID at all - this is an error, and the transaction should be rolled back.

Our post-trigger will be for a different type of business rule. When an order comes through containing an item with a negative quantity, we should consider this to be a refund. In our application, refunds need a second type of document to be created automatically. For example, when we see this order:

Then we should save the order as it was provided, and also create a document like this:

These two example triggers will let us explore many of the features of triggers that we discussed above.

Preparing Folders

Let's set up some folder structures and our configuration files. If you want to see the finished versions of these triggers, you can access them on GitHub - here is the pre-trigger and here is the post-trigger.

In a slight departure from what we did in part 2 and part 3 of this series, we'll be building two triggers in this part. First, create a folder named pre for the pre-trigger. Within that, create a package.json and use the following:

In the same folder, add a tsconfig.json file with the following contents:

And create a src folder with an empty file name validateOrder.ts inside it. We'll fill this in later.

Next, we'll create more or less the same folder structure for our post-trigger, inside a folder named post. The two differences will be: (1) that the trigger code file will be named addRefund.ts; and (2) the tsconfig.json should have the outFile line replaced with the correct filename, like this:

Our folder structure should now look like this:

  • /

    • /pre

      • package.json

      • tsconfig.json

      • /src

        • validateOrder.ts

    • /post

      • package.json

      • tsconfig.json

      • /src

        • addRefund.ts

Writing the Pre-Trigger

Now let's fill in our pre-trigger. Open up the empty validateOrder.ts file and add this code:

There's quite a lot here, so let's examine it - this time from the bottom up:

  • We declare a TypeScript interface named OrderDocument, which describes the shape of our order documents. We also include a base class named BaseDocument to keep things consistent with our post-trigger, described in the next section. As in part 1 of this series, we include both possible ways of expressing the customer ID so that we can reference these two properties within our script.

  • We declare a TypeScript const enum named DocumentTypes. This is where we'll define our possible order types. For now we only have one. TypeScript will substitute any reference to the values of these enums for their actual values - so in this case, any reference to DocumentTypes.Order will be replaced with the value of the string "order".

  • Our validateOrderImpl function contains our core implementation.Before we start, we also check the operation type that is being performed; if it's a Delete then we don't worry about doing the validation, but for Create and Replace operations we do. Then we pull out the document to validate and then perform the actual validation. If the customerId property is set, we transform the document so that it conforms to our expectation. If the customer ID is not provided at all then we throw an error.

  • Finally, our validateOrder method handles the interaction with the getContext() function and calls into the validateOrderImpl() function. As is the case in earlier parts of this series, we do this so that we can make this code more testable when we get to the part 5 of this series.

That's all there is to our pre-trigger. Now let's look at our post-trigger.

Writing the Post-Trigger

Open up the empty addRefund.ts file and add this in:

Let's explain this, again working from the bottom up:

  • Firstly, we declare interfaces for our OrderDocument and our RefundDocument, both of which will inherit (extend) the BaseDocument interface.

  • We also declare our const enum for DocumentTypes, and in this case we need to track both orders and refunds, so both of these are members of this enum.

  • Then we define our addRefundImpl method. It implements the core business rules we discussed - it checks the type of document, and if it's an order, it checks to see whether there are any items with a negative quantity. If there are, it creates a refund document and upserts that into the same collection. Like the pre-trigger, this trigger also only runs its logic when the operation type is Create or Replace, but not Delete.

  • Finally, the addRefund method is our entry point method, which mostly just wraps the addRefundImpl method and makes it easier for us to test this code out later.

Compiling the Triggers

Because we have two separate triggers, we need to compile them separately. First, open a command prompt or terminal to the pre folder, and execute npm run build. Then switch to the post folder and run npm run build again. You should end up with two output files.

In the pre/output/trig-validateOrder.js file, you should have a JavaScript file that looks like the following:

And in the post/output/trig-addRefund.js file, you should have the following JavaScript:

Now that we've got compiled versions of our triggers, let's deploy them.

Deploying the Triggers

As in parts 2 and 3 of this series, we'll deploy using the Azure Portal for now. In part 6 of this series we'll look at some better ways to deploy our code to Cosmos DB.

Open the Azure Portal to your Cosmos DB account, select Script Explorer, and then click the Create Trigger button. Now let's add the contents of the trig-validateOrder.js file into the big text box, enter the name validateOrder, and make sure that the Trigger Type drop-down is set to Pre and that the Trigger Operation drop-down is set to All, like this:


Click Save, close the blade, and click Create Trigger to create our second trigger. This one will be named addRefund, and ensure that Trigger Type is set to Post for this one. Trigger Operation should still be set to All, like this:


Testing the Triggers

Testing triggers can be a little tricky, because the Azure Portal doesn't have a way to fire triggers when we work with documents through Document Explorer. This means we have to use another approach to try out our triggers.

DocumentDB Studio is an open-source tool for working with Cosmos DB data. There are several such tools available now, including Azure Storage Explorer and Visual Studio, but other than DocumentDB Studio, I haven't found any that allow for firing triggers during document operations. Unfortunately DocumentDB Studio is only available for Windows, so if you're on a Mac or Linux, you may have to look at another approach to fire the triggers, like writing a test console application using the Cosmos DB SDK.

Install the latest release of DocumentDB Studio from the GitHub releases page, and go to File and then Add Account to log in using your Cosmos DB account endpoint and key. This will authenticate you to Cosmos DB and should populate the databases and collections list in the left pane.

Expand out the account, and go to the Orders database and Orders collection. Right-click the collection name and click Create Document:


Paste the contents of the following file into the document textbox:

We can instruct DocumentDB Studio to fire triggers by clicking on the RequestOptions tab and deselecting the Use default checkbox. Then in the PreTrigger field, enter validateOrder and in the PostTrigger field, enter addRefund:


Now we've prepared our request, click the Execute button on the toolbar, and we should see an error returned by Cosmos DB:


We can see the error we're getting is Customer ID is missing. Customer ID must be provided within the property. This is exactly what we wanted - the trigger is noticing that neither the customerId field or the fields are provided, and it throws an error and rolls back the transaction.

Now let's try a second document. This time, the customer ID is provided in the old format:

This time, you should see that the order is created, but that it's been transformed by the validateOrder trigger - the customer ID is now within the field:


Now let's try a third document. This one has a negative quantity on one of the items:

This document should be inserted correctly. To see the refund document that was also created, we need to refresh the collection - right-click the Orders collection on the left pane and click Refresh Documents feed:


You should see that a new document, refund-ORD6, has been created. The refund document lists the refunded product IDs, which in this case is just PROD1:



Triggers can be used to run our own custom code within a Cosmos DB document transaction. We can do a range of actions from within triggers such as validating the document's contents, modifying the document, querying other documents within the collection, and creating or modifying other documents. Triggers run within the context of the operation's transaction, meaning that we can cause the whole operation to be rolled back by throwing an unhandled error. Although they come with some caveats and limitations, triggers are a very powerful feature and enable a lot of useful custom functionality for our applications. In the next part of this series, we will discuss how to test our server-side code - triggers, stored procedures, and functions - so that we can be confident that it is doing what we expect.

Key Takeaways

  • Triggers come in two types - pre-triggers, which run before the operation, and post-triggers, which run after the operation. Both types can intercept and change the document, and can throw an error to cancel the operation and roll back the transaction.

  • Triggers can be configured to fire on specific operation types, or to run on all operation types.

  • The getOperationType() method on the Request object can be used to identify which operation type is underway, although note that the return values of this method don't correspond to the list of operation types that we can set for the trigger.

  • Cosmos DB must be explicitly instructed to fire a trigger when we perform an operation. We can do this through the client SDK or REST API when we perform the operation.

  • Only a single trigger of each type (pre- and post-operation) can be called for a given operation.

  • The Collection object can be used to query and modify documents within the collection.

  • The Request object's getBody() method can be used to examine the original request, and in pre-triggers, setBody() can modify the request before it's processed by Cosmos DB.

  • The Response object can only be used within post-triggers. It allows both reading the document after it was inserted (including auto-generated properties like _ts) and modifying the document.

  • Triggers cannot be nested - i.e. we cannot have a trigger perform an operation on the collection and request a second trigger be fired from that operation. Similarly a stored procedure can't fire a trigger.

  • Testing triggers can be a little difficult, since the Azure Portal doesn't provide a good way to do this currently. We can test using our own custom code, the REST API, or by using DocumentDB Studio.

  • You can view the code for this post on GitHub. The pre-trigger is here, and the post-trigger is here.

Cosmos DB Server-Side Programming with TypeScript - Part 3: Stored Procedures

This post was originally published on the Kloud blog.

Stored procedures, the second type of server-side code that can run within Cosmos DB, provide the ability to execute blocks of functionality from inside the database engine. Typically we use stored procedures for discrete tasks that can be encapsulated within a single invocation. In this post, we will discuss some situations where stored procedures can be used and the actions and queries that they can perform. We'll then start to work through the server-side API model, and look at how we can work with the incoming stored procedure invocation's request and response as well as the Cosmos DB collection itself. Then we'll build a simple stored procedure using TypeScript.

This post is part of a series of posts about server-side programming for Cosmos DB:

This post is part of a series:

  • Part 1 gives an overview of the server side programmability model, the reasons why you might want to consider server-side code in Cosmos DB, and some key things to watch out for.

  • Part 2 deals with user-defined functions, the simplest type of server-side programming, which allow for adding simple computation to queries.

  • Part 3 (this post) talks about stored procedures. These provide a lot of powerful features for creating, modifying, deleting, and querying across documents - including in a transactional way.

  • Part 4 introduces triggers. Triggers come in two types - pre-triggers and post-triggers - and allow for behaviour like validating and modifying documents as they are inserted or updated, and creating secondary effects as a result of changes to documents in a collection.

  • Part 5 discusses unit testing your server-side scripts. Unit testing is a key part of building a production-grade application, and even though some of your code runs inside Cosmos DB, your business logic can still be tested.

  • Finally, part 6 explains how server-side scripts can be built and deployed into a Cosmos DB collection within an automated build and release pipeline, using Microsoft Visual Studio Team Services (VSTS).

Stored Procedures

Stored procedures let us encapsulate blocks of functionality, and then later invoke them with a single call. As with the other server-side code types, the code inside the stored procedure run inside the Cosmos DB database engine. Within a stored procedure we can perform a range of different actions, including querying documents as well as creating, updating, and deleting documents. These actions are done within the collection that the stored procedure is installed in. Of course, we can augment these collection-based actions with custom logic that we write ourselves in JavaScript or - as is the case in this series - TypeScript.

Stored procedures are extremely versatile, and can be used for a number of different tasks including:

  • Encapsulating a complex set of queries and executing them as one logical operation - we will work with this example below.

  • Retrieving data for a report using a complex set of queries, and combining the results into a single response that can be bound to a report UI.

  • Generating mock data and inserting it into the collection.

  • Doing a batch insert, update, upsert, or delete of multiple documents, taking advantage of the transactional processing of stored procedures.

Of course, if you are building a simple application without the need for complex queries, you may be able to achieve everything you need with just the Cosmos DB client-side SDKs. However, stored procedures do give us some power and flexibility that is not possible with purely client-based development, including transaction semantics.

Transactions and Error Handling

Cosmos DB's client programming model does not provide for transactional consistency. However, stored procedures do run within an implicit transaction. This means that if you modify multiple documents within a stored procedure, then either all of those changes will be saved or - in the event of an error - none of them will be saved. Transactions provide four guarantees (atomicity, consistency, isolation, and durability, also known as ACID). More information on Cosmos DB transactions is available here.

The Cosmos DB engine handles committing and rolling back transactions automatically. If a stored procedure completes without any errors being thrown then the transaction will be committed. However, if even one unhandled error is thrown, the transaction will be rolled back and no changes will be made.

Working with the Context

Unlike user-defined functions, which we discussed in part 1 of this series, stored procedures allow us to access and make changes to the collection that they run within. We can also return results from stored procedures. Both of these types of actions require that we work with the context object.

Within a stored procedure, we can make a call to the getContext() function. This returns back an object with three functions.

  • getContext().getRequest() is used to access the request details. This is mostly helpful for triggers, and we will discuss this in part 4 of this series.

  • getContext().getResponse() lets us set the response that we should send back. For stored procedures, this is the way that we will return data back to the client if the stored procedure has something to return.

  • getContext().getCollection() gives us access to the Cosmos DB collection that the stored procedure runs within. In turn, this will let us read and write documents.

Each of the calls above corresponds to a type - Context, Request, Response, and Collection, respectively. Each of these types, in turn, provide a set of functions for interacting with the object. For example, getContext().getCollection().queryDocuments() lets us run a query against documents in the collection, and getContext().getResponse().setBody() lets us specify the output that we want the stored procedure to return. We'll see more of these functions as we go through this series.

Also note that the double-underscore (__) is automatically mapped to the getContext().getCollection() function. In this series we won't use this shortcut because I want to be more explicit, especially to help with testing when we get to part 5.

Type Definitions

Human-readable documentation for the types and their members is provided by Microsoft here. Of course, one of the main reasons we're using TypeScript in this series is so that we get type checking and strong typing against the Cosmos DB object model, so a human-readable description isn't really sufficient. In TypeScript, this is done through type definitions - descriptions of types and their members that TypeScript can use to power its type system.

While Microsoft doesn't provide first-party TypeScript type definitions for Cosmos DB, an open-source project named DefinitelyTyped provides and publishes these definitions here. (Note that the type definitions use the old name for Cosmos DB - DocumentDB - but they are still valid and being updated.)


One of the main things we'll frequently do from inside stored procedures is execute queries against Cosmos DB. This is how we can retrieve data and perform custom logic on it within the stored procedure. Cosmos DB provides an integrated JavaScript query syntax for executing queries. The syntax is documented here and lets us write queries like this:

which will map to the following SQL query:

However, there are some limitations to this query syntax. We can't perform aggregations, and we can't do queries using user-defined functions. These limitations may be lifted in future, but for now, in this series we will use the SQL syntax instead so that we can get the full power of Cosmos DB's query engine. We can use this type of syntax to make a SQL-based query:

In your own stored procedures and triggers you can decide which approach - integrated JavaScriptor SQL - makes more sense.

We can also request a document by ID if we want to as well:

Another important consideration when executing queries is that Cosmos DB limits the amount of time that a stored procedure can run for. We can test whether our stored procedure is approaching that time limit by inspecting the return value we receive when we submit the query. If we receive a false response, it means the query wasn't accepted - and that it's probably time to wrap up our logic before we get forcibly shut off. In some stored procedures, receiving a false like this may mean that you simply throw an error and consider the whole stored procedure execution to have failed.

Parameterised Queries

In the last section, we discussed using the collection.queryDocuments function to execute a SQL query against the collection. While this technically works, once we start including parameters in our queries then we'd need to concatenate them into the query string. This is a very, very bad idea - it opens us up to a class of security vulnerabilities called SQL injection attacks.

When we're writing SQL queries with parameters, we should instead the overload of the collection.queryDocuments function that accepts an IParameterizedQuery instead of a string. By doing this, we pass our parameters explicitly and ensure that they are handled and escaped appropriately by the database engine. Here's an example of how we can do this from our TypeScript code:

Updating Documents

We can also make changes to documents within the collection,. There are several functions on the Collection type to help with this, including:

  • createDocument inserts a new document into the collection.

  • replaceDocument updates an existing document in the collection. You must provide the document link to use this function.

  • deleteDocument deletes a document from the collection.

  • upsertDocumentThere are also functions to deal with attachments to documents, but we won't work with those in this series.

These functions that work with existing documents also take an optional parameter to specify the etag of the document. This allows for us to take advantage of optimistic concurrency. Optimistic concurrency is very useful, but is outside the scope of this series.

Structuring Your Stored Procedure Code

Stored procedures will often become more complex than UDFs, and may incorporate business logic as well as interaction with the Cosmos DB collection inside the code. When we're writing a production-grade application it is important that we structure our code correctly so that each part is testable, and has a clearly defined responsibility and interactions with other components. In Cosmos DB, we are interacting with the collection in order to execute queries and update documents, and these side effects are important to test as well. We'll discuss testing in more detail in part 5, but even before we worry about testing, it's a good idea to structure our code properly.

When I write Cosmos DB stored procedures, I find it's helpful to have a simple 'entry point' function. This entry point does the interaction with the getContext() function and retrieves the Collection, Request, and Response objects as required. These, along with any other parameters, are then passed into an internal implementation function, which in turn may invoke other functions to do other parts of the logic. By structuring the functions in this way we can ensure that each function has a clear purpose, and that the external components can be mocked and/or spied upon during our tests.

Writing our stored procedure in TypeScript also gives us the ability to store our functions in different .ts files if we want. This is helpful when we have long and complicated stored procedures, or if we want to keep interfaces and functions in separate files. This is largely a choice of style, since TypeScript's compiler will simply combine all of the functions together at compilation time and will emit a single JavaScript output file (because we have set the outFile property in our tsconfig.json file). One important note on this though - if you have functions spread across multiple files, it is important to pay attention to the order in which the functions get emitted. The stored procedure's 'entry point' function must appear first in the output JavaScript file. TypeScript can be instructed to do this by explicitly listing the entry point function's file first in the include directive within the tsconfig.json file, and then having a wildcard * to catch the remaining files, like this:

Calling Stored Procedures

Once a stored procedure is written, we need to call it in order to check that it's working, and then to use it in our real applications. There are several ways we can call our stored procedure and pass in the arguments it expects.

  • The Azure Portal provides a test interface for invoking stored procedures. This is how we will test the stored procedure we write below.

  • The client SDKs provide platform-specific features for invoking stored procedures. For example, the .NET client library for Cosmos DB provides the DocumentClient.ExecuteStoredProcedureAsync function, which accepts the ID of a stored procedure and any arguments that it might be expecting.

  • The Cosmos DB REST API also allows for invoking stored procedures directly.

Again, the exact way you call stored procedures may depend on the Cosmos DB API you are targeting - in this series we are using the SQL API, and the invocation mechanisms for MongoDB, Gremlin, and the other APIs may be different.

Now that we have talked about the different aspects of writing stored procedures, we can think about how we might use a stored procedure in our sample scenario.

Defining our Stored Procedure

In our hypothetical ordering application, we have a Cosmos DB collection of documents representing orders. Each order document contains a customer's ID (as we saw in part 2 of this series), and also contains a set of items that the customer ordered. An order item consists of the product ID the customer ordered and the quantity of that product.

Because Cosmos DB is a schemaless database, our order documents may coexist with many other documents of different types. Therefore, we also include a type parameter on the document to indicate that it is an order. This type discriminator pattern is quite common in schemaless databases.

An example of an order document in our collection is:

For this stored procedure we want to pass in a set of product IDs, and get back a grouped list of IDs for customers who have ordered any of those products. This is similar to doing a GROUP BY in a relational database - but currently Cosmos DB doesn't provide this sort of grouping feature, and so we are using a stored procedure to fill in the gap. Doing this from the client SDK would require multiple queries against the collection, but by using a stored procedure we can just make one call.

At a high level, our stored procedure logic looks like this:

1. Accept a list of product IDs as an argument.

2. Iterate through the product IDs.

3. For each product ID, run a query to retrieve the customer IDs for the customers that have ordered that product, filtering to only query on order documents. The query we'll run looks like this:

4. Once we have all of the customer IDs for each product in our list, create a JSON object to represent the results like this:

Preparing a Folder

Now we can start writing our stored procedure. If you want to compare against my completed stored procedure, you can access it on GitHub.

In part 2 of this series we covered how to set up the Cosmos DB account and collection, so I won't go through that again. We also will reuse the same folder structure as we did in part 2, so you can refer to that post if you're following along.

There's one major difference this time though. In our package.json file, we need to add a second entry into the devDependencies list to tell NPM that we want to include the TypeScript type definitions for Cosmos DB. Our package.json file will look like this:

Open a command prompt or terminal window, and run npm install within this folder. This will initialise TypeScript and the type definitions.

We'll also adjust the tsconfig.json file to emit a file with the name sp-getGroupedOrders.js:

Writing the Stored Procedure

Now let's create a file named src/getGroupedOrders.ts. This is going to contain our stored procedure code. Let's start with adding a basic function that will act as our entry point:

As discussed above, this is a pattern I try to follow when I write Cosmos DB server-side code. It helps to keep the interaction with the getContext() function isolated to this one place, and then all subsequent functions will work with explicit objects that we'll pass around. This will help when we come to test this code later. You can see that this function calls the getGroupedOrdersImpl function, which does the actual work we need done - we'll write this shortly.

Before then, though, let's write a basic interface that will represent our response objects:

Our getGroupedOrdersImpl function will accept an array of product IDs and the collection in which to query them, and it will return an array of these CustomersGroupedByProducts. Of course, since CustomersGroupedByProduct is a TypeScript interface, we can use it within our functions for type safety, but it will be stripped out when we compile the code into JavaScript.

Now we can add a basic shell of an implementation for our getGroupedOrdersImpl function. As you type this, notice that (if you're in an editor that supports it, like Visual Studio Code) you get IntelliSense and statement completion thanks to the TypeScript definitions:

This function prepares a variable called outputArray, which will contain all of our product/customer groupings. Then we have some placeholder code to perform our actual queries, which we'll fill in shortly. Finally, this function returns the output array.

Now we can fill in the placeholder code. Where we have REPLACEME in the function, replace it with this:

There's a lot going on here, so let's break it down:

  • The first part (lines 1-6) sets up a new IParameterizedQuery, which lets us execute a SQL query using parameters. As discussed above, this is a much more secure way to handle parameters than string concatenation. The query will find all orders containing the product ID we're looking for, and will return back the customer ID.

  • Next, the query callback function is prepared (lines 7-18). This is what will be called when the query results are available. In this function we pull out the results and push them onto our outputArray, ready to return to the calling function.

  • Then we try to execute the query against the collection by using the collection.queryDocuments() function (line 19). This function returns a boolean to indicate whether the query was accepted (line 20). If it wasn't, we consider this to be an error and immediately throw an error ourselves (line 22).

That's it! The full stored procedure file looks like this:

Here's what the your folder structure should now look like:

  • /

    • package.json

    • tsconfig.json

    • src/

      • getGroupedOrders.ts

Compiling the Stored Procedure

As in part 2, we can now compile our function to JavaScript. Open up a command prompt or terminal, and enter npm run build. You should see that a new output folder has been created, and inside that is a file named sp-getGroupedOrders.js. If you open this, you'll see the JavaScript version of our function, ready to submit to Cosmos DB. This has all of the type information removed, but the core logic remains the same. Here's what it should look like:

Deploying the Stored Procedure

Now let's deploy the stored procedure to our Cosmos DB collection. Open the Azure Portal, browse to the Cosmos DB account, and then go to Script Explorer. Click Create Stored Procedure.


Enter getGroupedOrders as the ID, and then paste the contents of the compiled sp-getGroupedOrder.js JavaScript file into the body.

Click Save to install the stored procedure to Cosmos DB. (Once again, this isn't the best way to install a stored procedure - we'll look at better ways in part 6 of this series.)

Testing the Stored Procedure

Now let's insert some sample data so that we can try the stored procedure out. Let's insert these sample documents using Document Explorer, as described in part 2.

Here are the three sample documents we'll use:

Now go back into Script Explorer, open the stored procedure, and notice that there is a test panel below the script body textbox. We can enter our stored procedures parameters into the Inputs field. Let's do that now. Enter [["P1", "P2", "P10"]] - be careful to include the double square brackets around the array. Then click the Save & Execute button, and you should see the results.

If we reformat them, our results look like the following. We can see that we have an array containing an object for each product ID we passed into the query, and each object has a list of customer IDs who ordered that product:

So our stored procedure works! We've now successfully encapsulated the logic involved in querying for customers that have ordered any of a set of products.


Stored procedures give us a way to encapsulate queries and operations to be performed on a Cosmos DB collection, and to invoke them as a single unit. Stored procedures run within an implicit transaction, so any unhandled errors will result in the changes being rolled back. Unlike in UDFs, we are also able to access the collection within a stored procedure by using the getContext() function, and by retrieving the Response and Collection objects. This allows us to return rich data, including objects and arrays, as well as to interact with the collection and its documents. In the next part of this series we will discuss triggers, the third type of server-side programming available in Cosmos DB, which allow us to intercept changes happening to documents within the collection.

Key Takeaways

  • Stored procedures encapsulate logic, queries, and actions upon documents within the collection.

  • Stored procedures provide transactional isolation, and all stored procedures run within a transaction scope.

  • The getContext() function allows us to access the Response and Collection objects.

  • TypeScript definitions are available to help when writing code against these objects.

  • Cosmos DB provides an integrated query syntax, which is great for simple queries, but doesn't cover all possible queries that can be executed against the collection.

  • Arbitrary SQL queries can also be executed. If these contain parameters then the IParameterizedQuery interface should be used to ensure safe coding practices are adhered to.

  • The order of functions inside the stored procedure's file matters. The first function will be the one that Cosmos DB treats as the entry point.

  • You can view the code for this post on GitHub.

Cosmos DB Server-Side Programming with TypeScript - Part 2: User-Defined Functions

This post was originally published on the Kloud blog.

User-defined functions (UDFs) in Cosmos DB allow for simple calculations and computations to be performed on values, entities, and documents. In this post I will introduce UDFs, and then provide detailed steps to set up a basic UDF written in TypeScript. Many of these same steps will be applicable to stored procedures and triggers, which we'll look at in future posts.

This is the second part of a series of blog posts on server-side development using Cosmos DB with TypeScript.

  • Part 1 gives an overview of the server side programmability model, the reasons why you might want to consider server-side code in Cosmos DB, and some key things to watch out for.

  • Part 2 (this post) deals with user-defined functions, the simplest type of server-side programming, which allow for adding simple computation to queries.

  • Part 3 talks about stored procedures. These provide a lot of powerful features for creating, modifying, deleting, and querying across documents - including in a transactional way.

  • Part 4 introduces triggers. Triggers come in two types - pre-triggers and post-triggers - and allow for behaviour like validating and modifying documents as they are inserted or updated, and creating secondary effects as a result of changes to documents in a collection.

  • Part 5 discusses unit testing your server-side scripts. Unit testing is a key part of building a production-grade application, and even though some of your code runs inside Cosmos DB, your business logic can still be tested.

  • Finally, part 6 explains how server-side scripts can be built and deployed into a Cosmos DB collection within an automated build and release pipeline, using Microsoft Visual Studio Team Services (VSTS).

User-Defined Functions

UDFs are the simplest type of server-side development available for Cosmos DB. UDFs generally accept one or more parameters and return a value. They cannot access Cosmos DB's internal resources, and cannot read or write documents from the collection, so they are really only intended for simple types of computation. They can be used within queries, including in the SELECT and WHERE clauses.

UDFs are simple enough that types are almost not necessary, but for consistency we will use TypeScript for these too. This will also allow us to work through the setup of a TypeScript project, which we'll reuse for the next parts of this series.

One note on terminology: the word function can get somewhat overloaded here, since it can refer to the Cosmos DB concept of a UDF, or to the TypeScript and JavaScript concept of a function. This can get confusing, especially since a UDF can contain multiple JavaScript functions within its definition. For consistency I will use UDF when I'm referring to the Cosmos DB concept, and function when referring to the JavaScript or TypeScript concept.


UDFs can accept zero or more parameters. Realistically, though, most UDFs will accept at least one parameter, since UDFs almost always operate on a piece of data of some kind. The UDF parameters can be of any type, and since we are running within Cosmos DB, they will likely be either a primitive type (e.g. a single string, number, array, etc), a complex type (e.g. a custom JavaScript object, itself comprised of primitive types and other complex types), or even an entire document (which is really just a complex type). This gives us a lot of flexibility. We can have UDFs that do all sorts of things, including:

  • Accept a single string as a parameter. Do some string parsing on it, then return the parsed value.

  • Accept a single string as well as another parameter. Based on the value of the second parameter, change the parsing behaviour, then return the parsed value.

  • Accept an array of values as a parameter. Combine the values using some custom logic that you define, then return the combined value.

  • Accept no parameters. Return a piece of custom data based on the current date and time.

  • Accept a complex type as a parameter. Do some parsing of the document and then return a single output.

Invoking a UDF

A UDF can be invoked from within the SELECT and WHERE clauses of a SQL query. To invoke a UDF, you need to include the prefix udf. before the function name, like this:

SELECT udf.parseString(c.stringField) FROM c

In this example, udf. is a prefix indicating that we want to call a UDF, and parseString is the name of the UDF we want to call. Note that this is identifier that Cosmos DB uses for the UDF, and is not necessarily the name of the JavaScript function that implements the UDF. (However, I strongly recommend that you keep these consistent, and will do so throughout this series.)

You can pass in multiple parameters to the UDF by comma delimiting them, like this:

SELECT udf.parseString(c.stringField, 1234) FROM c
SELECT udf.parseString(c.stringField1, c.stringField2) FROM C

To pass a hard-coded array into a UDF, you can simply use square brackets, like this:

SELECT udf.parseArray(["arrayValue1", "arrayValue2", "arrayValue3"])

Now that we've talked through some of the key things to know about UDFs let's try writing one, using our sample scenario from part 1.

Defining our UDF

Let's imagine that our order system was built several years ago, and our business has now evolved significantly. As a result, we are in the middle of changing our order schema to represent customer IDs in different ways. Cosmos DB makes this easy by not enforcing a schema, so we can simply switch to the new schema when we're ready.

Our old way of representing a customer ID was like this:

Now, though, we are representing customers with additional metadata, like this:

However, we still want to be able to easily use a customer's ID within our queries. We need a way to dynamically figure out the customer's ID for an order, and this needs to work across our old and new schemas. This is a great candidate for a UDF. Let's deploy a Cosmos DB account and set up this UDF.

Setting Up a Cosmos DB Account and Collection

First, we'll deploy a Cosmos DB account and set up a database and collection using the Azure Portal. (Later in this series, we will discuss how this can be made more automatable.) Log into the Azure Portal and click New, then choose Cosmos DB. Click Create.


We need to specify a globally unique name for our Cosmos DB account - I have used johnorders, but you can use whatever you want. Make sure to select the SQL option in the API drop-down list. You can specify any subscription, resource group, and location that you want. Click Create, and Cosmos DB will provision the account - this takes around 5-10 minutes.

Once the account is created, open it in the Portal. Click Add Collection to add a new collection.


Let's name the collection Orders, and it can go into a database also named Orders. Provision it with a fixed (10GB) capacity, and 400 RU/s throughput (the minimum).


Note that this collection will cost money to run, so you'll want to remember to delete the collection when you're finished. You can leave an empty Cosmo DB account for no charge, though.

Preparing a Folder

TypeScript requires that we provide it with some instructions on how to compile our code. We'll also use Node Package Manager (NPM) to tie all of our steps together, and so need to prepare a few things before we can write our UDF's code.

Note that in parts 2, 3, and 4 of this series, we will write each server-side component as if it was its own independent application. This is to keep each of these posts easy to follow in a standalone way. However, in parts 5 and 6, we will combine these into a single folder structure. This will more accurately represent a real-world application, in which you are likely to have more than one server-side component.Create a new folder on your local machine and add a file named package.json into it. This contains the NPM configuration we need. The contents of the file should be as follows:

The project.json file does the following:

  • It defines the package dependencies we have when we develop our code. Currently we only have one - typescript.

  • It also defines a script that we can execute from within NPM. Currently we only have one - build, which will build our UDF into JavaScript using TypeScript's tsc command.

At a command prompt, open the folder we've been working in. Run npm install, which will find and install the TypeScript compiler. If you don't have NPM installed, install it by following the instructions here.

Next, create a file named tsconfig.json. This is the TypeScript project configuration. This should contain the following:

The tsconfig.json instructs TypeScript to do the following:

  • Find and compile all of the files with the .ts extension inside the src folder (we haven't created this yet!).

  • Target ECMAScript 2015, which is the version of JavaScript that Cosmos DB supports. If we use more modern features of TypeScript, it will handle the details of how to emit these as ECMAScript 2015-compatible code.

  • Save the output JavaScript to a single file named output/udf-getCustomerId.js. This filename is arbitrary and it could be any name we want, but I find it helpful to use the convention of .js, especially as we add more code in later parts of this series.
    Note that the outFile directive means that, even if we included multiple source TypeScript files, TypeScript will save the complete compiled script into a single output file. This is in keeping with the requirement that Cosmos DB imposes that a server-side component has to be specified in a single file.

Writing the UDF

Now we can write our actual UDF code! If you want to compare against my completed UDF, you can access it on GitHub.

Create a folder named src, and within that, create a file named getCustomerId.ts. (Once again, this file doesn't need to be named this way, but I find it helpful to use the UDF's name for the filename.) Here's the contents of this file:

Briefly, here's an explanation of what this file does:

  • It declares a function named getCustomerId, which accepts a single parameter of type OrderDocument and returns a string. This is the function that represents our UDF.

  • The function inspects the document provided, and depending on which version of the schema it follows, it pulls the customer ID out of the appropriate field.

  • If the customer ID isn't in either of the places it expects to find them, it throws an error. This will be further thrown up to the client by Cosmos DB.

  • Finally, it declares an interface named OrderDocument. This represents the shape of the data we're expecting to store in our collection, and it has both of the ways of representing customer IDs.
    Note that we are using an interface and not a class, because this data type has no meaning to Cosmos DB - it's only for use at development and build time.
    Also note that we could put this into its own orderDocument.ts file if wanted to keep things separated out.

At the end of this, your folder should look something like this:

  • /

    • package.json

    • tsconfig.json

    • src/

      • getCustomerId.ts

You can access a copy of this as a GitHub repository here.

We have now written our first UDF! We're almost ready to run it - but before then, we need to compile it.

Compiling the UDF

At the command line run npm run build. This will run the build script we defined inside the package.json file, which in turn simply runs the tsc (TypeScript compiler) command-line application. tsc will find the tsconfig.json file and knows what to do with it.

Once it's finished, you should see a new output folder containing a file named udf-getCustomerId.js. This is our fully compiled UDF! It should look like the following:

If you compare this to the code we wrote in TypeScript, you'll see that it is almost the same - except all of the type information (variable types and the interface) have been stripped away. This means that we get the type safety benefits of TypeScript at authoring and compilation time, but the file we provide to Cosmos DB is just a regular JavaScript file.

Deploying the UDF

Now we can deploy the UDF. Back in the Azure Portal, open Script Explorer under the Collections section, and then click Create User Defined Function.


Enter getCustomerId in the ID field. This will be the name we address the UDF by when we start to call it from our queries. Note that you don't have to use the same ID field here as the JavaScript function name - in fact, the JavaScript function can be named anything you like. For clarify, though, I find it helpful to keep everything in sync.

Now we can copy and paste the contents of the udf-getCustomerId.js file into the large script text box.


Click Save to install the UDF to Cosmos DB.

Testing the UDF

Finally, let's test the UDF! We'll need to add a couple of pieces of sample data. Click Document Explorer under the Collections section, and then click the Create button. Paste in this sample document:

Click Save, and then close the blade and create a second document with the following contents:

This gives us enough to test with. Now click Query Explorer under the Collections section. Enter the following query:

SELECT, udf.getCustomerId(c) AS customerId FROM c

This query does the following:

  • Refers to each document within the current collection (c).

  • Runs the getCustomerId UDF, passing in the document contents. Note that to refer to a UDF, you must prefix the name of the UDF with udf..

  • Projects out the document ID (id) and the customerId as customerId.

You should see the following output:

That's exactly what we wanted to see - the UDF has pulled out the correct field for each document.

As a point of interest, notice the Request Charge on the query results. Try running the query a few times, and you should see that it fluctuates a little - but is generally around 3.5 RUs.

Now let's try passing in an invalid input into our UDF. Run this query:

SELECT udf.getCustomerId('123') FROM c

Cosmos DB will give you back the error that our UDF threw because the input data (123) didn't match either of the schemas it expected:

Encountered exception while executing Javascript.
  Exception = Error: Document with id undefined does not contain customer ID in recognised format.
  Stack trace: Error: Document with id undefined does not contain customer ID in recognised format.
    at getCustomerId (getCustomerId.js:11:5)
    at __docDbMain (getCustomerId.js:15:5)
    at Global code (getCustomerId.js:1:2)

So we've now tested out the old customer ID format, the new customer ID format, and some invalid input, and the UDF behaves as we expect.


UDFs provide us with a way to encapsulate simple computational logic, and to expose this within queries. Although we can't refer to other documents or external data sources, UDFs are a good way to expose certain types of custom business logic. In this post, we've created a simple UDF for Cosmos DB, and tested it using a couple of simple documents. In the next part of this series we'll move on to stored procedures, which allow for considerably more complexity in our server-side code.

Key Takeaways

  • UDFs are intended for simple computation.

  • They can be used within queries, including in the SELECT and WHERE clauses.

  • UDFs cannot access anything within the Cosmos DB collection, nor can they access any external resources.

  • They can accept one or more parameters, which must be provided when calling the UDF.

  • The name of the JavaScript function does not have to match the name of the UDF, but to keep your sanity, I highly recommend keeping them consistent.

  • UDFs can be invoked from within a query by using the udf. prefix, e.g. SELECT udf.getCustomerId(c) FROM c.

  • You can view the code for this post on GitHub.