HowTo- Leverage Azure CosmosDB metrics via Azure Monitor & analytic log to find issues ?

15 min readSep 4, 2017

This article will help you understand Cosmos DB metrics — specially Storage and Throughput on the Cosmos DB portal.
We will find out how to tackle errors like “ Storage quota for ‘Document’ exceeded” or “Request rate too large”.
Reference for Partitioning
Reference for Throughput is here and here(Thanks Thomas Weiss).
Reference for Indexing is here.
Reference for Financially backed SLA for throughput, latency, availability , consistency is here.
Big picture of Cosmos DB platform is here.

Most of us can’t imagine life ending soon, Scott Hanselman’s keystrokes left site makes end clearer in terms we understand — keystrokes left — what you want to do with rest of them. Getting more time to focus on important things is big help any day.

In IT industry Database folks spend lot of their time managing hardware, software details and learning every new trick to ensure end users have good SLAs for performance, throughput and availability. This requires them to learn and leverage expertise across clustering software, monitoring/management software and then all the gory implementation details of a particular data store. Never be surprised if DBA/developer tells you exactly how disk needs to be formatted, blocks laid out and then what jvm or memory/disk flush settings need to be done to get “started”. In majority of the cases database folks know lot more about how things work — indexes, statistics, execution plans, buffers, caches, disk issues. The other chunk of keystroke gobbler is capacity management/forecast of a database workloads.

With Cosmos DB intent of the designers is to help developers and DBA focus on domain thus bringing the SLAs for low latency, throughput, consistency and availability all with underlying powerful global distribution available on tap.

It also makes it easier for developer to come in from direction they are comfortable — treat data store as document store(with option of MongoDB API), key value store or a graph with APIs which make sense. To make evolution of the data model easier system does not impose any specific schema. To take away the index management issues related to performance management— it is designed to be auto-indexing system with low latency guarantee of durable commits with consistent index.

To ensure application provides low latency access and availability — system is designed to be globally available at click of button or script. Client SDK is intelligent enough to be auto-homing to find “right” location every time in event of challenges. To provide easy capacity management — it helps you define the Throughput you need. Then it guarantees Throughput while providing unlimited elasticity for changing as per the needs of the workload.

All these features add up lot of clicks end of the day.

Metrics

Cosmos DB provides metrics which cover SLA on the server side on the portal. These metrics can be pushed to Cosmos DB’s portal, Azure Monitor and via Diagnostic log to Storage/EventHub or Azure Monitor’s Log.

Application Insights and Log Analytics

Information about usage of throughput or query execution statistics is shared back to the client. You can take information about the Request Charge which is provided for every operation and push it to Application Insights and get instant alerts for extra RU or time spent at backend. Or you can also enable diagnostics push to log analytics which can collect this data for your analysis. Both systems provide good way to query the data to pinpoint issues. Some of the information you can stuff in from client side into AppInsight include full query (which is not available in log analytics today) or AD identity of execution principal. Log Analytics can log all interaction irrespective of the origin which hit the database. This means log analytics will also have interaction started from portal for example. Azure Monitor for example can provide metrics for total requests which make up the operations.

Background

In general with Azure Cosmos DB two things matter — distribution of data and Throughput.

Throughput is unlimited and elastic to scale up or down to meet the needs. Every operation against a container results in usage a portion of this throughput. This information is generally returned back as Request Charge as explained in the documentation.

Distribution of data depends on the chosen partition key and thus is important piece to ensure data can scale out while serving the requisite queries or ingesting the data.

Data storage and Throughput is practically unlimited. If you need more than what is available on the portal — always connect to us via support or askcosmosdb@microsoft.com.

Portal metrics

Today we will take a look at how to interpret these metrics shared via browser. They are collected every 5 minutes and have 7 days of history.

Storage

Storage metrics pane provides you 3 pieces of information — Number of physical partitions in a container, Total size occupied by data and index in a container, Total number of “entities” in a container.

1st — notice number of physical partitions and their size.

Cosmos DB provides you storage you require and increases it when your application needs it. Thus freeing you from the chore of capacity management.

Total number of physical partitions and their individual size

2nd — How much total space your container occupies including both data and index. Index generally is very small percentage of the actual data.

Index space can be controlled by looking at the data and queries — if you do not query certain data — you can remove that attribute from indexing. Thus saving index space. By default all data/attributes are indexed by Cosmos DB today with performance guarantee. Index alteration is online process, your application continues operation without interruption. Thus change of schema — because of evolution of application needs — is a non-event for this indexing mechanism.

Below is another way to get similar information from the code.

// Measure the document size usage (which includes the index size)  ResourceResponse<DocumentCollection> collectionInfo = await client.ReadDocumentCollectionAsync(UriFactory.CreateDocumentCollectionUri(“db”, “coll”)); 
 Console.WriteLine(“Document size quota: {0}, usage: {1}”, collectionInfo.DocumentQuota, collectionInfo.DocumentUsage);

3rd — How much data each physical partition has ? Which top 10 keys dominate the physical partition ?

Amount of storage per physical partition and dominant keys in given partition

1st thing to notice is almost good distribution ranging from 4.4 GiB to 6.5 GiB. When you click on physical partition you will notice top partition keys.

At times you will see no dominant keys in a partition — do not worry — that just shows a good cardinality of the partition key. In that case no one key dominates the distribution. Or at times you will see only few keys and their size. That tells you these keys dominate in terms of logical size other keys in that physical partition.

What would be a bad distribution ?

Any container where distribution across all partitions does not look sort of equal — say +/- 10% variance is something that requires investigation. Very irregular or highly skewed data distribution points to a bad partition key. It is essential to look at the partition key and the kind of data which is coming in. It is always better to verify the loaded data data distribution follows your expectation.

Example — A customer who chose a date as partition key and only has data for few dates (may be he has just started) would have very skewed data and in worst case all queries/ingestion happening in that partition key.

Impact — You might see highly skewed throughput requests on one particular partition — you can try to co-relate this on Throughput metrics pane for Max consumed RU/second for each partition. If partition is hot in terms of all data ending up in one place and queries also hitting it — you have potential issue to fix.

As you can see below consumed per partition RUPS for 1st partition is pretty high compared to provisioned throughput. This will result in throttling and show up as latency (if sdk defaults are not overriden).

Skewed throughput load on particular partition

c. How many documents exist in each physical partition ?

This data is present in the last pane — you should see correlation between this and the second pane. Again here you will see the dominant keys.

Total number of entities per physical partition

What to take away — This pane provides you individual entity count in physical partition and ability to look up dominant partition key. This is important as you could have few large entities vs lot of small entities in the container. In general the same partition key should turn up for the size and count.

When do I get error Storage quota for ‘Document’ exceeded ?

You will get this error if your physical partition is full and this is possible when your logical partition key size entities exceeds the physical partition. This indicates issue with partition key. To understand this how it can happen — say you want to store voter information and you choose gender as partition key. You start inserting data and you can end in situation where either of the gender data (assuming every entity representing voter data is 1 KB) exceeds about 10 M items or 10 GB of data. This implies you need to look at your queries and redesign the partition key and re-load the data.

Reference — Partitioning

Throughput

Throughput has been described well here and here(Thanks Thomas Weiss).

Key part of monitoring throughput is to see if you are getting throttled and then see if consumption is more than provisioned and dig into the actual time to see if it was related to one partition or many. This is usually followed by looking at client side telemetry (today) of RU charges for operations to find out culprit operations/queries.

So what you get to see on the throughput pane ?

1st look at top line which might have details like below. Important line is latter one which provides you information about what throughput each partition gets

Equal Throughput distribution across all partitions

Then you should focus on number of requests per minute

Here you see various http status codes — you should look for light blue for total number of successful requests. Most of the http status codes are described here.

2nd focus on number of throttled operations per minute

This tells you have a problem, which needs resolution. So you need to get an idea about whether max consumed RU/second actually shows issue. In this case very clearly provisioned throughput is very small — around 1.5 K and actual usage touches upto 15 K RU.

If you want more finer details of this metrics — you can click on double arrow to get more details.

get-more details

3rd Now we need to see whether something specific happened in one partition or multiple partitions at particular time. You can hover on the metric above to get the time information when we hit the peak of about 15k — looks like 3.05 PM. So let us enter that information of time in Max consumed RU/second by each physical partition and hit apply

Single Partition seeing heavy usage at 3.05 PM

There is obvious problem here. One partition is consuming lot of RU and others are silent. Indicates issue with either workload, data skew or both. With stream analytics + eventhub performance testing we have seen the skew of data, skew of workload resulting in this kind of behavior a lot. Fix here is to emulate real workload and fix the data skew.

In general in throughput pane — you will see either you have skewed workload in terms of workload happening on one partition or all partitions under provisioned. In either case you can get throttled and the error “request rate too large”.

When do I get error Request rate too large?

When a throttle occurs, the server will preemptively end the request with RequestRateTooLargeException (HTTP status code 429) and return the x-ms-retry-after-ms header indicating the amount of time, in milliseconds, that the user must wait before reattempting the request.

HTTP Status 429
Status Line: RequestRateTooLarge
x-ms-retry-after-ms :100

If you are using the SDK , then most of the time you never have to deal with this exception, as the current version of the Client SDK implicitly catches this response, respects the server-specified retry-after header, and retries the request. Unless your account is being accessed concurrently by multiple clients, the next retry will succeed.

If you have more than one client simultaneously operating above the request rate, the default retry behavior may not suffice, and the client will throw a DocumentClientException with status code 429 to the application. In cases such as this, you may consider handling retry behavior and logic in your application’s error handling routines or increasing the reserved throughput for the container.

Reference for Throughput planning, handling issues.

Examples

Example below shows more patterns and details of the issue. Sometimes it helps to look at the data from 7 days or 24 hour perspective rather than 1 hr which is default.

This was a condition where you can see periodicity of “some repeated job” which results in heavy throttled requests on Aug 17. In this case provisioned throughput was reduced — which was the culprit here resulting in heavy throttling on particular day. Most of the times customer increases/decreased the throttling during processing. In this case they did not increase throughput when needed.

Yet another example is this one where customer needs to increase throughput as he does not have any set pattern.

Another example of more throughput required as all partitions are used

In this case all partitions are using up throughput more than provisioned. As in this case we tried to see at 10.19 am what was happening. You can see slight skew — but it is not significant. In this case we need to see why we got the spike.

So help comes in form of query statistics and also examining request charges for all operations at client side.

Client Side metrics

Other thing you need to do is look at request charge of each operation and keep them in some form of telemetry (Azure AppInsights) to find out which operations contributed to that spike in usage. Sometimes it is difficult to get one operation as you might be doing multiple of those operations from different client machines — so again telemetry is important for you to build.

// Measure the performance (request units) of writes. 
 ResourceResponse<Document> response = await client.CreateDocumentAsync(UriFactory.CreateDocumentCollectionUri(“db”, “coll”), myDocument); 
 Console.WriteLine(“Insert of document consumed {0} request units”, response.RequestCharge);// Measure the performance (request units) of queries. 
 IDocumentQuery<dynamic> queryable = client.CreateDocumentQuery(UriFactory.CreateDocumentCollectionUri(“db”, “coll”), queryString).AsDocumentQuery();double totalRequestCharge = 0;
 while (queryable.HasMoreResults)
 {
 FeedResponse<dynamic> queryResponse = await queryable.ExecuteNextAsync<dynamic>(); 
 Console.WriteLine(“Query batch consumed {0} request units”,queryResponse.RequestCharge);
 totalRequestCharge += queryResponse.RequestCharge;
 }Console.WriteLine(“Query consumed {0} request units in total”, totalRequestCharge);

Example for looking at all operation’s Request Charge is here.

Request Charge for an operation gives you how many of those operations can be done per second. You already know Throughput per partition which is possible — thus you can theoretical limit for operations per second.

For example

Let us say you have single partition with 10000 RU and single query operation taking up 100 RUs. So how many of these queries can execute simultaneously ? 10000/100 = 100 operations.

So now you get an idea of doing better job of your throughput and operations. You can get initial estimate of CRUD operations with help of capacity estimator tool. Once you have good enough data — and your queries laid out — you can see what could be possible every second.

Total Throughput/per second = Read Operation + Update operation + delete Operation + Create Operation + Query operation RU Charge (all at per second).

Reference for capacity estimation

Client Query statistics — Cosmos DB provides query execution statistics. (focused today on DocumentDB API to be extended to other API in form it makes sense to the API)

IDocumentQuery<dynamic> query = client.CreateDocumentQuery(
 UriFactory.CreateDocumentCollectionUri(DatabaseName, CollectionName), 
 “SELECT * FROM c WHERE c.city = ‘Seattle’”, 
 new FeedOptions 
 { 
 PopulateQueryMetrics = true, 
 MaxItemCount = -1, 
 MaxDegreeOfParallelism = -1, 
 EnableCrossPartitionQuery = true 
 }).AsDocumentQuery();FeedResponse<dynamic> result = await query.ExecuteNextAsync();// Returns metrics by partition key range Id IReadOnlyDictionary<string, QueryMetrics> metrics = result.QueryMetrics;

You need to execute queries with feed option of query statistics, this tells where system is spending time. Whether we are using index, whether we are retrieving lot of documents vs reading them. So what can you do here — when you find issues — generally if scans seem to be happening — provide partition key or better filter condition.

So you know the Request charge and where you spend time — you can take requisite action to modify index or the query.

Client Diagnostics are important when you need details or support/product team will require you to get this for debugging. This over and above ActitivityId, Exception, RequestCharge, RequestLatency, StatusCode.

Client Diagnostics (.net) , (java) can be retrieved on demand cases for direct mode of SQL API(As of 2019 May).

Azure Monitor Metrics

Azure monitor metrics provide way to get distribution of your operations across all requests. This helps to quickly get an idea about what kind of requests are coming in and what you need to to possibly do — increase throuhgput or otherwise.

You can leverage Azure monitor metrics to create alerts and dashboards (look at workbooks feature — to monitor multiple cosmos db collections/accounts).

Azure Monitor Log Analytics

Cosmos DB provides immutable record of all interaction (CRUD on data plane and control plane activities) at via log analytics. All diagnostic information can either be pushed to “Blob/EventHub or Log Analytics”. The latter is what you can learn from here .

Log analytics provides easy way of finding out

volume of operations which are coming in at within given time
operations which take more RU or
operations which take more time

So how can you leverage it ?

First enable the log analytics for data plane requests and activity. Then use queries which help you to get information you are after.

AzureActivity | where Caller == “x@x.com”
| summarize count() by Resource// which operations are happening 
AzureDiagnostics | where ResourceProvider==”MICROSOFT.DOCUMENTDB” and Category==”DataPlaneRequests” | summarize count() by OperationName// which operation is taking more than 3 ms
AzureDiagnostics 
| where toint(duration_s) > 3000 and ResourceProvider==”MICROSOFT.DOCUMENTDB” and Category==”DataPlaneRequests”
| summarize count() by clientIpAddress_s, TimeGenerated// when long running operation ran — be careful these can pull in lot of data //so put a boundary box around them in terms of time
 AzureDiagnostics | where ResourceProvider==”MICROSOFT.DOCUMENTDB” and Category==”DataPlaneRequests”| project TimeGenerated , toint(duration_s)/2000 | render timechart// Which operation more than x RUs
AzureDiagnostics 
| where ResourceProvider==”MICROSOFT.DOCUMENTDB” and Category==”DataPlaneRequests” and TimeGenerated between (datetime(2018–05–07) .. datetime(2018–05–09)) 
and toreal(requestCharge_s) > 2.0 
| project Resource , collectionRid_s , duration_s , OperationName, ResourceTypeAzureDiagnostics
| where ResourceProvider==”MICROSOFT.DOCUMENTDB” and Category==”DataPlaneRequests” and toreal(requestCharge_s) > 5 
 and OperationName == “Query”

If it is write — you need to see if you want to not filter all items in attributes in query. You can decide to drop index on selective attributes.

To summarize

Verify Storage pane has information that your data is well distributed across the partitions.
Verify throttles are not happening with help if Throughput pane.
If throttles are present — verify if throttles are across all the partitions or specific partition.
Use client side metrics of Request charge and query statistics to find out where you spend RU /time— always focus on dominant queries/operations to focus on 80% of workload. Either increase the RU if needed or modify the query.
Use log analytics to deeper pinpointed investigation of issues and focus on queries or other operations.

In one of the next articles we will focus on looking at query execution statistics and finding alternatives.

If you ever need help of Cosmos DB — please reach out to us on askcosmosdb@microsoft.com

HowTo- Leverage Azure CosmosDB metrics via Azure Monitor & analytic log to find issues ?

Written by Govind Kanshi